Solr Catalog Provider Apps

Description

The Solr Catalog Provider (SCP) is an implementation of the CatalogProvider interface using Apache Solr as a data store. Some notable features of the SCP are

  • Supports Extensible Metacards
  • Fast, simple contextual searching
  • Indexes XML Attributes as well as CDATA sections and XML text elements
  • Simple relative (//element) and absolute pathing (/root/element) xpath support.
  • Works with an embedded, local Solr Server (all-in-one Catalog)
    • No configuration necessary on a single-node Distribution
    • Data directory of solr indexes are configurable
  • Works with a standalone Solr Server

Usage

The Solr Catalog Provider is used in conjunction with an Apache Solr Server data store. The Solr Catalog Provider can work with an embedded, local Solr Server instance or an external Solr Server. The embedded, local instance is a lightweight solution that works out of the box without any configuration. It however does not provide a Solr Admin GUI or a "REST-like HTTP/XML and JSON API." If that is necessary, see Standalone Solr Server App.

Two different apps exist:

  1. catalog-solr-app - includes the Solr Catalog Provider and an embedded Solr Server all-in-one. 
  2. catalog-solr-external-app - includes only the Solr Catalog Provider and is meant to be used only with the Standalone Solr Server App (catalog-solr-server-app).

App Comparison / Usage Chart

FeatureEmbedded SolrStandalone Solr
 ProConProCon
Scalability 
  • Does not scale. Only runs one single server instance.
  • Does not allow the middle tier to be scaled.
  • Allows the middle-tier to be scaled by pointing various middle-tier instances to one server facade.
  • Possible data tier scalability with Solr Cloud. Solr Cloud allows for "high scale, fault tolerant, distributed indexing and search capabilities."
  • Solr Cloud Catalog Provider not implemented yet
Flexibility
  • Can be embedded in Java easily.
  • Requires no HTTP Connection.
  • Uses the same interface as the Standalone Solr Server uses under the covers
  • Allows for full control over the Solr Server. No synchronous issues on startup, i.e. the Solr Server will synchronously start up with the Solr Catalog Provider
  • Runs within the same JVM
  • Setup and Installation is simple. "Unzip and run"
  • Only can be interfaced using Java
  • REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language
  • Ability to run in separate or same JVM of middle tier.
 
(Administrative) Tools
  • External open source tools like Luke will work to allow admins to check index contents
  • JMX metrics reporting is enabled
  • No Admin Console. 
  • No easy way to natively access (out of the box) what is in the index files or health of server at the data store level.
  • Contains Solr Admin GUI, which allows admins to query, check health, see metrics, see configuration files and preferences, etc
  • External open source tools like Luke will work to allow admins to check index contents
  • JMX metrics reporting is enabled
 
Security
  • Does not open any ports which means no ports have to be secured.
 
  • Inherits app server security
  • Admin console must be secured and is openly accessible
  • REST-like HTTP/XML and JSON APIs must be secured
  • Current Catalog Provider implementation requires sending unsecured messages to Solr. Without a coded solution, requires network or firewall restrictions in order to secure.
Performance
  • Requires no HTTP or Network overhead
  • Near Real-time indexing
  • Can understand complex queries
 
  • If scaled, high performance.
  • Near Real-time indexing
  • Possible network latency impact
  • Extra overhead when sent over HTTP. Extra parsing for XML, JSON, or other interface formats
  • Possible limitations upon requests and queries dependent on HTTP server settings
Backup/Recovery
  • Can manually or through custom scripts back up the indexes
  • Must copy files when server is shutdown
  • Built-in Recovery tools which allow in-place backups (does not require server shutdown).
  • Backup of Solr indexes can be scripted
  • Recovery is done as a HTTP request

When to Use

Use the local, embedded Solr Catalog Provider when only one 

DDF

 instance is necessary and scalability is not an issue. The local, embedded Solr Catalog Provider requires no installation and little to no configuration since it ready out of the box. It is great for demonstrations, training exercises, or for sparse querying and ingesting. For heavy querying and ingesting processing, use the Standalone Solr Server on a separate machine. See the Standalone Solr Server Recommended Configuration. Both Apps can store the same amount of data and indexes.

Installation and Uninstallation

The Solr Source can be installed and uninstalled using the normal processes described in the Configuration section.  Ensure that no other Catalog Provider is installed before installing this Catalog Provider.

Embedded Solr Server and Solr Catalog Provider

Users can use the Solr Catalog Provider with an embedded Solr Server by installing (if it is not already installed) the feature, catalog-solr-provider. By installing this feature, it will install a Solr Catalog Provider and start up an instance of an embedded Java Solr Server within the distribution. Optional configurations are available. See the Configuration section for more information.

Solr Catalog Provider for External Solr

If the Solr Server is not embedded within the current distribution, a user will need to install the external Solr Catalog Provider by installing the feature catalog-solr-external-provider. This will not install any Solr Servers. Installing the feature will provide a user an "unconfigured" Solr Catalog Provider. See the Configuration section for how to configure this Solr Catalog Provider to connect to an external Solr Server.

Configuration

Embedded Solr Server and Solr Catalog Provider

No configuration is necessary in order for the embedded Solr Server and the Solr Catalog Provider to work out of the box. The standard installation described above is sufficient. When the catalog-solr-provider feature is installed, it by default stores the Solr index files to <DISTRIBUTION_INSTALLATION_DIRECTORY>/data/solr. A user does not have to specify any parameters. In addition, the catalog-solr-provider feature contains all files necessary for Solr to start the server. 

However, this component can be configured to specify the directory to use for data storage using the normal processes described in the Configuration section. 

The configurable properties for the SCP are accessed from the Catalog Embedded Solr Catalog Provider Configurations in the Admin Console.

Handy Tip

The Embedded (Local) Solr Catalog Provider works on startup without any configuration because a local embedded Solr Server is automatically started and pre-configured.

Configurable Properties
TitlePropertyTypeDescriptionDefault
Value
Required
Data Directory File PathdataDirectoryPathString

Specifies the directory to use for data storage. A shutdown of the server is necessary for this property to take effect. If a filepath is provided with directories that don't exist, SCP will attempt to create those directories. Out of the box (without configuration), the SCP writes to <DISTRIBUTION_INSTALLATION_DIRECTORY>/data/solr

If dataDirectoryPath is left blank (empty string), it will default to <DISTRIBUTION_INSTALLATION_DIRECTORY>/data/solr.

If Data Directory File Path is a relative string, the SCP will write the data files starting at the installation directory. For instance if the string scp/solr_data is provided, then the data directory would be at <DISTRIBUTION_INSTALLATION_DIRECTORY>/scp/solr_data

If Data Directory File Path is /solr_data in Windows, the Solr Catalog Provider will write the data files starting at the beginning of the drive such as C:/solr_data.

It is recommended to use an absolute filepath to minimize confusion such as /opt/solr_data in Linux or C:/solr_data in Windows. Permissions are necessary to write to the directory.

 No
Force Auto CommitforceAutoCommitBoolean / CheckboxWARNING: Performance Impact. Only in special cases should auto-commit be forced. Forcing auto-commit makes the search results visible immediately. No
Solr Configuration Files

The Apache Solr product has Configuration files to customize behavior for the Solr Server. These files can be found at <DISTRIBUTION_INSTALLATION_DIRECTORY>/etc/solr. Care must be taken in editing these files because they will directly affect functionality and performance of the Solr Catalog Provider. A restart of the distribution is necessary for changes to take effect. 

Note on Solr Configuration File Changes

Solr Configuration files should not be changed in most cases. Changes to the schema.xml will most likely need code changes within the Solr Catalog Provider.

Moving Solr Data to a New Location

If SCP has been installed for the first time, then changing the (1) Data Directory File Path property and (2) restarting the distribution is all that is necessary because no data had been written into Solr previously. Nonetheless, if a user needs to change the location after the user has already ingested data in a previous location, these are the steps that are required:

  1. Change the Data Directory File Path property within the Catalog Embedded Solr Catalog Provider Configuration in the Admin Console to the desired future location of the Solr data files.
  2. Shutdown the distribution.
  3. Find the future location on the drive. If the current location does not exist, create the directories.
  4. Find the location of where the current Solr data files exist and copy all the directories in that location to the future the location. For instance if the previous Solr data files existed at C:/solr_data and it is necessary to move it to C:/solr_data_new, then copy all directories within C:/solr_data into  C:/solr_data_new. Usually this consists of copying the index and tlog directories into the new data directory.
  5. Start the distribution. SCP should recognize the index files and be able to query them as it could before.

Note: Changes Require a Distribution Restart

If the Data Directory File Path property is changed, no changes will occur to the SCP until the distribution has been restarted.

Handy Tip

If Data Directory File Path property is changed to a new directory and the previous data is not moved into that directory, then no data will be in Solr. Solr will create an empty index. Therefore it is possible to have multiple places where Solr files are stored and a user can toggle between those locations for different sets of data.

Solr Catalog Provider for External Solr

In order for the external Solr Catalog Provider to work, it must be pointed at the external Solr Server. When the catalog-solr-external-provider feature is installed, it is in an unconfigured state until the user provides an HTTP url to the external Solr Server. The configurable properties for this SCP are accessed from the Catalog External Solr Catalog Provider Configurations in the Admin Console.

Configurable Properties
TitlePropertyTypeDescriptionDefault
Value
Required
HTTP URLurlString

HTTP URL of the standalone, preconfigured Solr 4.x Server.

http://localhost:8181/solrYes
Force Auto CommitforceAutoCommitBoolean / CheckboxWARNING: Performance Impact. Only in special cases should auto-commit be forced. Forcing auto-commit makes the search results visible immediately.Unchecked/FalseNo

Implementation Details

Indexing Text

When storing fields, the Solr Catalog Provider will analyze and tokenize the text values of STRING_TYPE and XML_TYPE AttributeTypes. These types of fields are indexed in at least three ways: in raw form, analyzed with case sensitivity, and then analyzed without concern to case sensitivity. Concerning XML, the Solr Catalog Provider will analyze and tokenize XML CDATA sections, XML Element text values, and XML Attribute values. 

Known Issues

  • When searching with the ANY_TEXT field, SCP does not search all text fields within the Catalog Provider. Instead, it searches the METADATA field.
    SCP does not fully support spatial capabilities.
  • SCP does not support ingesting or querying GeometryCollection WKT.
  • SCP does not support crossing the International Date Line or pole wrapping.
  • SCP ignores the following AttributeDescriptor methods: isIndexed, isTokenized, isMultivalued, isStored. SCP instead indexes, tokenizes, and stores data based on the AttibuteFormat, such as it will store and not index all fields labeled as AttributeFormat.BINARY regardless of user instruction. SCP as of now has no multivalue support even though it is supported by Solr.
  • SCP has a 1000 nautical mile limit for nearest neighbor queries.  If a point is not provided, then the centroid of the shape will be used for distance calculations.  
  • SCP does not support full TextPath.  Attributes and equality expressions are not supported currently.