Solr Query Performance
Status: In-Progress
Summary
The currently Solr Provider implementation stores and indexes data within the same Solr core/collection (catalog). This same core/collection also contains UI storage such as Workspaces, Queries, etc. In fact the UI elements being stored in this same collection has caused us to implement some inefficient work arounds with commit times in order to support that use-case. As a result, resource that may not change as often also loses some caching ability. Additionally, stored data is persisted in the same core/collection as the index. These are separate capabilities that should be independent from one another.
Stakeholders
DDF installations with large numbers of records and defined performance requirements.
Use Cases
List of key use cases to consider for the design. Could be in the form of user stories for user facing features.
Use case 1 title
Use case 1 description if necessary
Assumptions
- The implementation needs to work with existing methods in the Catalog Framework, and should allow for a single catalog core to exist for the unzip and run use case.
Functional Requirements
N/A
Non-functional Requirements
- Query performance. In order to provide a Google-like experience, queries need to be extremely performant and take advantage of caching whenever possible.
- Ingest performance should also not suffer with the implementation. A secondary consideration is that as long as the data storage and index are separated, it is possible to updated indexing strategies over time and re-index based on stored data.
Qualities
Constraints
- Solution should work within existing SolrProvider in DDF, and not require CatalogFramework changes. The solution will should appear to the CatalogFramework as a CatalogProvider if the existing SolrCatalogProvider cannot be easily extended.
Candidate Solutions
Update Solr Catalog Provider
The Solr Catalog Provider will be updated.
- Separate the storage and indexing capabilities
- New interfaces:
SolrStorageProvider
SolrIndexProvider - New Implementation
SimpleStorageProvider
SimpleIndexProvider
- New interfaces:
- Create SolrClientFactory
This factory will be responsible for returning a Solr client that is configured to communicate with the correct Solr core/collection. For the unzip and run use case, this will always return catalog as it is currently implemented.
Basic Class Diagram: solr-provider.pptx
Analysis
Security Impact / Considerations
Threat Model
N/A
Risk Analysis
N/A - The approach has no additional security implications above the current implementations. If a non-Solr storage provider is used, then the incorporation of the tool will need to be individually analyzed (e.g. MySQL, MariaDB, PostgreSQL).
Strengths
- Ability to optimize indexing and storage independently.
- Storage provider can be anything including traditional RDBMS.
- Will have internal abilities to reindex data from existing stored data.
- Ability to optimize index collections more appropriately based on data being stored (UI data, resource data, archive vs current, resource content, etc).
- Opportunities for different commit settings (Near Real Time vs Seconds/Minutes)
- Opportunities to optimize index fields based on the particular data being indexed
Weaknesses
- Complexity
- Storing and Retrieving data will be more complex as the routing rules need to be deterministic for splitting cores/collections.
- Retrieving data is more complex as it requires querying the index for matches, and using returned ID's to get individual records.
Risks
list of potential risks for approach
Recommendation
Tickets to perform this work:
Create new CatalogProvider: https://github.com/codice/ddf/issues/4999
Create API to separate StorageProvider and IndexProvider: https://github.com/codice/ddf/issues/5000
Create Solr StorageProvider: https://github.com/codice/ddf/issues/5001
Create Solr IndexProvider: https://github.com/codice/ddf/issues/5002
Create PostgreSQL StorageProvider: https://github.com/codice/ddf/issues/5003
Decision
Final decision on selected solution. Can be overturned with a new ADR if context significantly changes/invalidated or better solutions become available.
Consequences
Impacts discovered during development or post deployment can be recorded here.