Solr Cross DC is a simple cross-data-center fail-over solution for Apache Solr. It has three key components: the CrossDC Producer, the CrossDC Consumer, and Apache Kafka. The Producer is a Solr UpdateProcessor plugin that forwards updates from the primary data center, while the Consumer is an update request consumer application that receives updates in the backup data center. Kafka is the distributed queue that connects the two.
Apache Solr CrossDC is a robust fail-over solution for Apache Solr, facilitating seamless replication of Solr updates across multiple data centers. Consisting of three integral components - the CrossDC Producer, the CrossDC Consumer, and Apache Kafka, it provides high availability and disaster recovery for your Solr clusters.
- CrossDC Producer: a suite of Solr components responsible for intercepting Solr requests in the source Solr instance, and then sending them to a distributed queue:
MirroringUpdateProcessor
plugin intercepts indexing updates,MirroringCollectionsHandler
plugin intercepts collection admin requests,MirroringConfigSetsHandler
plugin intercepts ConfigSet management requests.
- CrossDC Consumer: a request consumer application that pulls mirrored requests from the distributed queue and forwards them to a Solr cluster in the backup data center.
- Apache Kafka: A distributed queue system that links the Producer and Consumer.
Implementing the Solr CrossDC involves the following steps:
- Apache Kafka Cluster: Ensure the availability of an Apache Kafka cluster. This acts as the distributed queue interconnecting data centers.
- CrossDC Solr Plugin: Install this plugin on each node in your Solr cluster (in both primary and backup data centers).
- Configure
solrconfig.xml
to reference the newMirroringUpdateProcessor
and set it up for the Kafka cluster. - Optionally configure
solr.xml
to useMirroringCollectionsHandler
andMirroringConfigSetsHandler
if necessary.
- Configure
- CrossDC Consumer Application: Install this application in the backup data center, then configure it for both the Kafka and Solr clusters.
-
Define the sharedLib directory in solr.xml and place the CrossDC producer plug-in jar file in this directory. solr.xml
<solr> <str name="sharedLib">${solr.sharedLib:}</str>
-
Add the new UpdateProcessor in solrconfig.xml.
<updateRequestProcessorChain name="mirrorUpdateChain" default="true"> <processor class="org.apache.solr.update.processor.MirroringUpdateRequestProcessorFactory"> <str name="bootstrapServers">${bootstrapServers:}</str> <str name="topicName">${topicName:}</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
-
Add an external version constraint UpdateProcessor to the update chain added to solrconfig.xml to accept user-provided update versions. See https://solr.apache.org/guide/8_11/update-request-processors.html#general-use-updateprocessorfactories and https://solr.apache.org/docs/8_1_1/solr-core/org/apache/solr/update/processor/DocBasedVersionConstraintsProcessor.html
-
Start or restart your Solr clusters.
Add the following line to solr.xml
:
<solr>
<str name="collectionsHandler">org.apache.solr.handler.admin.MirroringCollectionsHandler</str>
...
</solr>
In addition to the general properties that determine distributed queue parameters, this handler supports the following properties:
mirror.collections
- comma-separated list of collections for which the admin commands will be mirrored. If this list is empty or the property is not set then admin commands for all collections will be mirrored.
Add the following line to solr.xml
:
<solr>
<str name="configSetsHandler">org.apache.solr.handler.admin.MirroringConfigSetsHandler</str>
...
</solr>
The required configuration properties are:
bootstrapServers
: list of servers used to connect to the Kafka clustertopicName
: Kafka topicName used to indicate which Kafka queue the Solr updates will be pushed on. This topic must already exist.
Optional configuration properties:
batchSizeBytes
: (integer) maximum batch size in bytes for the Kafka queuebufferMemoryBytes
: (integer) memory allocated by the Producer in total for bufferinglingerMs
: (integer) amount of time that the Producer will wait to add to a batchrequestTimeout
: (integer) request timeout for the ProducerindexUnmirrorableDocs
: (boolean) if set to True, updates that are too large for the Kafka queue will still be indexed on the primary.enableDataCompression
: (boolean) whether to use compression for data sent over the Kafka queue - can be none (default), gzip, snappy, lz4, or zstdnumRetries
: (integer) Setting a value greater than zero will cause the Producer to resend any record whose send fails with a potentially transient error.retryBackoffMs
: (integer) The amount of time to wait before attempting to retry a failed request to a given topic partition.deliveryTimeoutMS
: (integer) Updates sent to the Kafka queue will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires firstmaxRequestSizeBytes
: (integer) The maximum size of a Kafka queue request in bytes - limits the number of requests that will be sent over the queue in a single batch.dqlTopicName
: (string) if not empty then requests that failed processingmaxAttempts
times will be sent to a "dead letter queue" topic in Kafka (must exist if configured).mirrorCommits
: (boolean) if "true" then standalone commit requests will be mirrored, otherwise they will be processed only locally.expandDbq
: (enum) if set toexpand
(default) then Delete-By-Query will be expanded before mirroring into series of Delete-By-Id, which may help with correct processing of out-of-order requests on the consumer side. If set tonone
then Delete-By-Query requests will be mirrored as-is.
- Extract the CrossDC Consumer distribution file into an appropriate location in the backup data center.
- Start the Consumer process using the included start script at bin/crossdc-consumer.
- Configure the CrossDC Consumer with Java system properties using the CROSSDC_CONSUMER_OPTS environment variable.
Currently the following endpoints are exposed (on local port configured using port
property, default is 8090):
/metrics
- (GET) this endpoint returns JSON-formatted metrics describing various aspects of document processing in Consumer./threads
- (GET) returns a plain-text thread dump of the JVM running the Consumer application.
The required configuration properties are:
bootstrapServers
: list of Kafka bootstrap servers.topicName
: Kafka topicName used to indicate which Kafka queue the Solr updates will be pushed to. This can be a comma separated list for the Consumer if you would like to consume multiple topics.zkConnectString
: Zookeeper connection string used to connect to Solr.
Optional configuration properties:
consumerProcessingThreads
: The number of threads used by the consumer to concurrently process updates from the Kafka queue.port
: local port for the API endpoints. Default is8090
.collapseUpdates
: (enum) when set toall
then all incoming update requests (up tomaxCollapseRecords
) will be collapsed into a single UpdateRequest, as long as their parameters are identical. When set topartial
(default) then only requests without deletions will be collapsed - requests with any delete ops will be sent individually in order to preserve ordering of updates. When set tonone
the incoming update requests will be sent individually without any collapsing. NOTE: requests of other types than UPDATE are never collapsed.maxCollapseRecords
: maximum number of incoming update request to collapse into a single outgoing request. Default is500
.
Optional configuration properties used when the consumer must retry by putting updates back on the Kafka queue:
batchSizeBytes
: maximum batch size in bytes for the Kafka queuebufferMemoryBytes
: memory allocated by the Producer in total for bufferinglingerMs
: amount of time that the Producer will wait to add to a batchrequestTimeout
: request timeout for the ProducermaxPollIntervalMs
: the maximum delay between invocations of poll() when using consumer group management.
Manage configuration centrally in Solr's Zookeeper cluster by placing a properties file called crossdc.properties
in the root Solr
Zookeeper znode, eg, /solr/crossdc.properties. The bootstrapServers
and topicName
properties can be included in this file. For
the Producer plugin, all of the crossdc configuration properties can be used here. For the CrossDC Consumer application you can also
configure all of the crossdc properties here, however you will need to set the zkConnectString
as a system property to allow retrieving
the rest of the configuration from Zookeeper.
To make the Cross DC UpdateProcessor optional in a common solrconfig.xml
, use the enabled attribute. Setting it to false turns the processor into a NOOP in the chain.
- When
expandDbq
property is set toexpand
(default) then Delete-By-Query converts to a series of Delete-By-Id, which can be much less efficient for queries matching large numbers of documents. Setting this property tonone
results in forwarding a real Delete-By-Query - this reduces the amount of data to mirror but may cause different results due to the potential re-ordering of failed & re-submitted requests between Consumer and the target Solr. - When
collapseUpdates
is set toall
then multiple requests containing a mix of add and delete ops will be collapsed into a single outgoing request. This will cause the original ordering of add / delete ops to be lost (because Solr processing of an update request always processes all add ops first, and only then the delete ops), which may affect the final outcome when some of the ops refer to the same document ids.