This project includes tools to build a Hive SerDe to index Hive tables to Solr.
-
Index Hive table data to Solr.
-
Read Solr index data to a Hive table.
-
Kerberos support for securing communication between Hive and Solr.
-
As of v2.2.4 of the SerDe, integration with Lucidworks Fusion is supported.
-
Fusion’s index pipelines can be used to index data to Fusion.
-
Fusion’s query pipelines can be used to query Fusion’s Solr instance for data to insert into a Hive table.
-
Note
|
This SerDe does not yet officially support Hive 2.x. Additionally, it should only be used with Solr 5.0 and higher. |
This project has a dependency on the solr-hadoop-common
submodule (contained in a separate GitHub repository, https://github.com/lucidworks/solr-hadoop-common). This submodule must be initialized before building the SerDe .jar.
Warning
|
You must use Java 8 or higher to build the .jar. |
To initialize the submodule, pull this repo, then:
git submodule init
Once the submodule has been initialized, the command git submodule update
will fetch all the data from that project and check out the appropriate commit listed in the superproject. You must initialize and update the submodule before attempting to build the SerDe jar.
-
if a build is happening from a branch, please make sure that
solr-hadoop-common
is pointing to the correct SHA. (see https://github.com/blog/2104-working-with-submodules for more details)
hive-solr $ git checkout <branch-name> hive-solr $ cd solr-hadoop-common hive-solr/solr-hadoop-common $ git checkout <SHA> hive-solr/solr-hadoop-common $ cd ..
The build uses Gradle. However, you do not need to have Gradle already installed before attempting to build.
To build the jar files, run this command:
./gradlew clean shadowJar --info
This will build a single .jar file, solr-hive-serde/build/libs/solr-hive-serde-2.6.3.jar
, which can be used with Hive v0.14, v0.15, and v1.2.1. Other Hive versions may work with this jar, but have not been tested.
If GitHub and SSH are not configured the following exception will be thrown:
Cloning into 'solr-hadoop-common'...
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:LucidWorks/solr-hadoop-common.git' into submodule path 'solr-hadoop-common' failed
To fix this error, use the generating an SSH key tutorial.
In order for Hive to work with Solr, the Hive SerDe jar must be added as a plugin to Hive.
From a Hive prompt, use the ADD JAR
command and reference the path and filename of the SerDe jar for your Hive version.
hive> ADD JAR solr-hive-serde-2.6.3.jar;
This can also be done in your Hive command to create the table, as in the example below.
Indexing data to Solr or Fusion requires creating a Hive external table. An external table allows the data in the table to be used (read or write) by another system or application outside of Hive.
For integration with Solr, the external table allows you to have Solr read from and write to Hive.
To create an external table for Solr, you can use a command similar to below. The properties available are described after the example.
hive> CREATE EXTERNAL TABLE solr (id string, field1 string, field2 int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
LOCATION '/tmp/solr'
TBLPROPERTIES('solr.server.url' = 'http://localhost:8888/solr',
'solr.collection' = 'collection1',
'solr.query' = '*:*');
In this example, we have created an external table named "solr", and defined a custom storage handler (STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
).
The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/solr
.
In the section TBLPROPERTIES, we define several properties for Solr so the data can be indexed to the right Solr installation and collection:
solr.zkhost
-
The location of the ZooKeeper quorum if using LucidWorks in SolrCloud mode. If this property is set along with the
solr.server.url
property, thesolr.server.url
property will take precedence. solr.server.url
-
The location of the Solr instance if not using LucidWorks in SolrCloud mode. If this property is set along with the
solr.zkhost
property, this property will take precedence. solr.collection
-
The Solr collection for this table. If not defined, an exception will be thrown.
solr.query
-
The specific Solr query to execute to read this table. If not defined, a default of
*:*
will be used. This property is not needed when loading data to a table, but is needed when defining the table so Hive can later read the table. lww.commit.on.close
-
If true, inserts will be automatically committed when the connection is closed. True is the default.
lww.jaas.file
-
Used only when indexing to or reading from a Solr cluster secured with Kerberos.
This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Solr and Hive.
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:
Client { --(1) com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/data/solr-indexer.keytab" --(2) storeKey=true useTicketCache=true debug=true principal="[email protected]"; --(3) };
-
The name of this section of the JAAS file. This name will be used with the
lww.jaas.appname
parameter. -
The location of the keytab file.
-
The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and Hive.
-
lww.jaas.appname
-
Used only when indexing to or reading from a Solr cluster secured with Kerberos.
This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.
If the table needs to be dropped at a later time, you can use the DROP TABLE command in Hive. This will remove the metadata stored in the table in Hive, but will not modify the underlying data (in this case, the Solr index).
If you use Lucidworks Fusion, you can index data from Hive to Solr via Fusion’s index pipelines. These pipelines allow you several options for further transforming your data.
Tip
|
If you are using Fusion v3.0.x, you already have the Hive SerDe in Fusion’s If you are using Fusion 3.1.x, you will need to download the Hive SerDe from http://lucidworks.com/connectors/. Choose the proper Hadoop distribution and the resulting .zip file will include the Hive SerDe. A 2.2.4 or higher jar built from this repository will also work with Fusion 2.4.x releases. |
This is an example Hive command to create an external table to index documents in Fusion and to query the table later.
hive> CREATE EXTERNAL TABLE fusion (id string, field1 string, field2 int)
STORED BY 'com.lucidworks.hadoop.hive.FusionStorageHandler'
LOCATION '/tmp/fusion'
TBLPROPERTIES('fusion.endpoints' = 'http://localhost:8764/api/apollo/index-pipelines/<pipeline>/collections/<collection>/index',
'fusion.fail.on.error' = 'false',
'fusion.buffer.timeoutms' = '1000',
'fusion.batchSize' = '500',
'fusion.realm' = 'KERBEROS',
'fusion.user' = '[email protected]',
'java.security.auth.login.config' = '/path/to/JAAS/file',
'fusion.jaas.appname' = 'FusionClient',
'fusion.query.endpoints' = 'http://localhost:8764/api/apollo/query-pipelines/pipeline-id/collections/collection-id',
'fusion.query' = '*:*');
In this example, we have created an external table named "fusion", and defined a custom storage handler (STORED BY 'com.lucidworks.hadoop.hive.FusionStorageHandler'
).
The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/fusion
.
In the section TBLPROPERTIES, we define several properties for Fusion so the data can be indexed to the right Fusion installation and collection:
fusion.endpoints
-
The full URL to the index pipeline in Fusion. The URL should include the pipeline name and the collection data will be indexed to.
fusion.fail.on.error
-
If
true
, when an error is encountered, such as if a row could not be parsed, indexing will stop. This isfalse
by default. fusion.buffer.timeoutms
-
The amount of time, in milliseconds, to buffer documents before sending them to Fusion. The default is 1000. Documents will be sent to Fusion when either this value or
fusion.batchSize
is met. fusion.batchSize
-
The number of documents to batch before sending the batch to Fusion. The default is 500. Documents will be sent to Fusion when either this value or
fusion.buffer.timeoutms
is met. fusion.realm
-
This is used with
fusion.user
andfusion.password
to authenticate to Fusion for indexing data. Two options are supported,KERBEROS
orNATIVE
.Kerberos authentication is supported with the additional definition of a JAAS file. The properties
java.security.auth.login.config
andfusion.jaas.appname
are used to define the location of the JAAS file and the section of the file to use.Native authentication uses a Fusion-defined username and password. This user must exist in Fusion, and have the proper permissions to index documents.
fusion.user
-
The Fusion username or Kerberos principal to use for authentication to Fusion. If a Fusion username is used (
'fusion.realm' = 'NATIVE'
), thefusion.password
must also be supplied. fusion.password
-
This property is not shown in the example above. The password for the
fusion.user
when thefusion.realm
isNATIVE
. java.security.auth.login.config
-
This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Fusion and Hive.
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:
Client { --(1) com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/data/fusion-indexer.keytab" --(2) storeKey=true useTicketCache=true debug=true principal="[email protected]"; --(3) };
-
The name of this section of the JAAS file. This name will be used with the
fusion.jaas.appname
parameter. -
The location of the keytab file.
-
The service principal name. This should be a different principal than the one used for Fusion, but must have access to both Fusion and Hive. This name is used with the
fusion.user
parameter described above.
-
fusion.jaas.appname
-
Used only when indexing to or reading from Fusion when it is secured with Kerberos.
This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.
fusion.query.endpoints
-
The full URL to a query pipeline in Fusion. The URL should include the pipeline name and the collection data will be read from. You should also specify the request handler to be used.
If you do not intend to query your Fusion data from Hive, you can skip this parameter.
fusion.query
-
The query to run in Fusion to select records to be read into Hive. This is
*:*
by default, which selects all records in the index.If you do not intend to query your Fusion data from Hive, you can skip this parameter.
Once the table is configured, any syntactically correct Hive query will be able to query the index.
For example, to select three fields named "id", "field1", and "field2" from the "solr" table, you would use a query such as:
hive> SELECT id, field1, field2 FROM solr;
Replace the table name as appropriate to use this example with your data.
To join data from tables, you can make a request such as:
hive> SELECT id, field1, field2 FROM solr left
JOIN sometable right
WHERE left.id = right.id;
And finally, to insert data to a table, simply use the Solr table as the target for the Hive INSERT statement, such as:
hive> INSERT INTO solr
SELECT id, field1, field2 FROM sometable;
Solr includes a small number of sample documents for use when getting started. One of these is a CSV file containing book metadata. This file is found in your Solr installation, at $SOLR_HOME/example/exampledocs/books.csv
.
Using the sample books.csv
file, we can see a detailed example of creating a table, loading data to it, and indexing that data to Solr.
CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; --(1)
LOAD DATA LOCAL INPATH '/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books; --(2)
ADD JAR solr-hive-serde-2.6.3.jar; --(3)
CREATE EXTERNAL TABLE solr (id STRING, cat_s STRING, title_s STRING, price_f FLOAT, in_stock_b BOOLEAN, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) --(4)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' --(5)
LOCATION '/tmp/solr' --(6)
TBLPROPERTIES('solr.zkhost' = 'zknode1:2181,zknode2:2181,zknode3:2181/solr',
'solr.collection' = 'gettingstarted',
'solr.query' = '*:*'), --(7)
'lww.jaas.file' = '/data/jaas-client.conf'; --(8)
INSERT OVERWRITE TABLE solr SELECT b.* FROM books b;
-
Define the table
books
, and provide the field names and field types that will make up the table. -
Load the data from the
books.csv
file. -
Add the
solr-hive-serde-2.6.3.jar
file to Hive. Note the jar name shown here omits the version information which will be included in the jar file you have. If you are using Hive 0.13, you must also use a jar specifically built for 0.13. -
Create an external table named
solr
, and provide the field names and field types that will make up the table. These will be the same field names as in your local Hive table, so we can index all of the same data to Solr. -
Define the custom storage handler provided by the
solr-hive-serde-2.6.3.jar
. -
Define storage location in HDFS.
-
The query to run in Solr to read records from Solr for use in Hive.
-
Define the location of Solr (or ZooKeeper if using SolrCloud), the collection in Solr to index the data to, and the query to use when reading the table. This example also refers to a JAAS configuration file that will be used to authenticate to the Kerberized Solr cluster.
-
Fork this repo i.e. <username|organization>/hadoop-solr, following the fork a repo/ tutorial. Then, clone the forked repo on your local machine:
$ git clone https://github.com/<username|organization>/hadoop-solr.git
-
Configure remotes with the configuring remotes tutorial.
-
Create a new branch:
$ git checkout -b new_branch $ git push origin new_branch
Use the creating branches tutorial to create the branch from GitHub UI if you prefer.
-
Develop on
new_branch
branch only, do not mergenew_branch
to your master. Commit changes tonew_branch
as often as you like:$ git add <filename> $ git commit -m 'commit message'
-
Push your changes to GitHub.
$ git push origin new_branch
-
Repeat the commit & push steps until your development is complete.
-
Before submitting a pull request, fetch upstream changes that were done by other contributors:
$ git fetch upstream
-
And update master locally:
$ git checkout master $ git pull upstream master
-
Merge master branch into
new_branch
in order to avoid conflicts:$ git checkout new_branch $ git merge master
-
If conflicts happen, use the resolving merge conflicts tutorial to fix them:
-
Push master changes to
new_branch
branch$ git push origin new_branch
-
Add jUnits, as appropriate, to test your changes.
-
When all testing is done, use the create a pull request tutorial to submit your change to the repo.
Note
|
Please be sure that your pull request sends only your changes, and no others. Check it using the command:
|