Solr Query Performance

Status: In-Progress

Summary

The currently Solr Provider implementation stores and indexes data within the same Solr core/collection (catalog). This same core/collection also contains UI storage such as Workspaces, Queries, etc. In fact the UI elements being stored in this same collection has caused us to implement some inefficient work arounds with commit times in order to support that use-case. As a result, resource that may not change as often also loses some caching ability. Additionally, stored data is persisted in the same core/collection as the index. These are separate capabilities that should be independent from one another. 

Stakeholders

DDF installations with large numbers of records and defined performance requirements.

Use Cases

List of key use cases to consider for the design.  Could be in the form of user stories for user facing features.

Use case 1 title

Use case 1 description if necessary

Assumptions

  1. The implementation needs to work with existing methods in the Catalog Framework, and should allow for a single catalog core to exist for the unzip and run use case.

Functional Requirements

N/A

Non-functional Requirements

  1. Query performance. In order to provide a Google-like experience, queries need to be extremely performant and take advantage of caching whenever possible. 
  2. Ingest performance should also not suffer with the implementation. A secondary consideration is that as long as the data storage and index are separated, it is possible to updated indexing strategies over time and re-index based on stored data. 

Qualities


Constraints

  1. Solution should work within existing SolrProvider in DDF, and not require CatalogFramework changes. The solution will should appear to the CatalogFramework as a CatalogProvider if the existing SolrCatalogProvider cannot be easily extended. 

Candidate Solutions

Update Solr Catalog Provider

The Solr Catalog Provider will be updated.

  1. Separate the storage and indexing capabilities
    1. New interfaces: 
      SolrStorageProvider
      SolrIndexProvider
    2. New Implementation
      SimpleStorageProvider
      SimpleIndexProvider
  2. Create SolrClientFactory
    This factory will be responsible for returning a Solr client that is configured to communicate with the correct Solr core/collection. For the unzip and run use case, this will always return catalog as it is currently implemented. 

Basic Class Diagram: solr-provider.pptx

Analysis

Security Impact / Considerations 

Threat Model

N/A

Risk Analysis

N/A - The approach has no additional security implications above the current implementations. If a non-Solr storage provider is used, then the incorporation of the tool will need to be individually analyzed (e.g. MySQL, MariaDB, PostgreSQL). 

Strengths

  1. Ability to optimize indexing and storage independently.
    1. Storage provider can be anything including traditional RDBMS.
    2. Will have internal abilities to reindex data from existing stored data.
  2. Ability to optimize index collections more appropriately based on data being stored (UI data, resource data, archive vs current, resource content, etc). 
    1. Opportunities for different commit settings (Near Real Time vs Seconds/Minutes)
    2. Opportunities to optimize index fields based on the particular data being indexed

Weaknesses

  1. Complexity
    1. Storing and Retrieving data will be more complex as the routing rules need to be deterministic for splitting cores/collections.
    2. Retrieving data is more complex as it requires querying the index for matches, and using returned ID's to get individual records.

Risks

list of potential risks for approach

Recommendation

Tickets to perform this work:

Create new CatalogProvider: https://github.com/codice/ddf/issues/4999

Create API to separate StorageProvider and IndexProvider: https://github.com/codice/ddf/issues/5000

Create Solr StorageProvider: https://github.com/codice/ddf/issues/5001

Create Solr IndexProvider: https://github.com/codice/ddf/issues/5002

Create PostgreSQL StorageProvider: https://github.com/codice/ddf/issues/5003


Decision

Final decision on selected solution.  Can be overturned with a new ADR if context significantly changes/invalidated or better solutions become available.

Consequences

Impacts discovered during development or post deployment can be recorded here.