Releases: jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore
Run Spark 3.3.0 locally with remote AWS S3 using Glue Metastore
What and Why?
The only purpose of this fork is to create this release and share the pre-built JARs with the community.
It allowed me to run local SparkSession connected to AWS Glue with AWS S3 backend storage with Iceberg tables. In essence, I can debug jobs that would normally run within the AWS EMR cluster.
How to use it?
I assume you already have relevant Spark version installed and SPARK_HOME
environment variable set up.
For me, it was Spark 3.3.0, since this is used in AWS EMR, to have it as close to it as possible.
- Download & unpack the built JARs & copy them to the
jars
directory in$SPARK_HOME
.
cd /tmp
wget https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/download/spark-3.3.0/spark-3.3.0-jars.tgz
sha512sum -c <(curl -sL https://github.com/jirislav/aws-glue-data-catalog-client-for-apache-hive-metastore/releases/download/spark-3.3.0/spark-3.3.0-jars.tgz.sha512)
cd "$SPARK_HOME/jars"
tar -xf /tmp/spark-3.3.0-jars.tgz
- Make sure to use appropriate settings for Hive and Spark. I suggest you keep these configuration in
~/.config/spark/
and export this asSPARK_CONF_DIR
.
First, let's configure the clean Spark configuration directory.
# Put this into your .bashrc / .zshrc, or export it everytime you run Spark
export SPARK_CONF_DIR=~/.config/spark
cd "$SPARK_CONF_DIR"
Next, put the configuration there:
spark-defaults.conf
cat <<EOF > "$SPARK_CONF_DIR/spark-defaults.conf"
spark.hadoop.hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.hadoop.hive.metastore.warehouse.dir s3://YOUR_S3_BUCKET/default
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, com.amazonaws.auth.EnvironmentVariableCredentialsProvider, org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.warehouse.dir hdfs:///user/spark/warehouse
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalogImplementation hive
spark.sql.catalog.iceberg org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.catalog-impl org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.iceberg.io-impl org.apache.iceberg.aws.s3.S3FileIO
# You need IAM role to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:YOUR_REGION:1234567890:table/IcebergLockTable
#spark.sql.catalog.iceberg.lock-impl org.apache.iceberg.aws.dynamodb.DynamoDbLockManager
#spark.sql.catalog.iceberg.lock.table IcebergLockTable
spark.sql.catalog.iceberg.warehouse s3://YOUR_S3_BUCKET/iceberg
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.sql.parquet.output.committer.class com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
spark.sql.sources.partitionOverwriteMode dynamic
spark.sql.thriftserver.scheduler.pool fair
spark.sql.ui.explainMode extended
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
EOF
hive-site.xml
cat <<EOF > "$SPARK_CONF_DIR/hive-site.xml"
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>aws.glue.endpoint</name>
<value>https://glue.YOUR_REGION.amazonaws.com</value>
</property>
<property>
<name>aws.glue.region</name>
<value>YOUR_REGION</value>
</property>
<property>
<name>aws.glue.connection-timeout</name>
<value>30000</value>
</property>
<property>
<name>aws.glue.socket-timeout</name>
<value>30000</value>
</property>
<!--
<property>
<name>aws.glue.proxy.host</name>
<value>YOUR_GLUE_PROXY_HOST</value>
</property>
<property>
<name>aws.glue.proxy.port</name>
<value>8888</value>
</property>
-->
<property>
<!-- Setting for Hive2. See https://github.com/awslabs/aws-glue-catalog-sync-agent-for-hive/issues/3 -->
<name>hive.imetastoreclient.factory.class</name>
<value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
</property>
<property>
<!-- Setting for Hive3. See https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore -->
<name>hive.metastore.client.factory.class</name>
<value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
</property>
<!-- Hive Metastore connection settings -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://FOR_YOUR_METASTORE_URI_SEE_SPARK_UI_ENVIRONMENT_TAB:9083</value>
<description>URI for client to connect to metastore server</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>s3://YOUR_S3_BUCKET/default</value>
<description>Location of default database for the warehouse</description>
</property>
<property>
<name>hive.metastore.connect.retries</name>
<value>15</value>
</property>
<property>
<name>aws.glue.cache.table.enable</name>
<value>true</value>
</property>
<property>
<name>aws.glue.cache.table.size</name>
<value>1000</value>
</property>
<property>
<name>aws.glue.cache.table.ttl-mins</name>
<value>30</value>
</property>
<property>
<name>aws.glue.cache.db.enable</name>
<value>true</value>
</property>
<property>
<name>aws.glue.cache.db.size</name>
<value>1000</value>
</property>
<property>
<name>aws.glue.cache.db.ttl-mins</name>
<value>30</value>
</property>
</configuration>
EOF
And you're done! Now, given you have the SPARK_HOME
& SPARK_CONF_DIR
environment variables set, you can launch Spark locally, with remote connection to the data on S3 — enjoy! 🎉
How to build it yourself?
Anyone can build this release. Here is the step-by-step of how I've built the JARs (I assume you already have git
and mvn
installed):
- Build the Hive JARs with the Spark patch.
cd /tmp
wget https://issues.apache.org/jira/secure/attachment/12958418/HIVE-12679.branch-2.3.patch
git clone https://github.com/apache/hive.git
cd hive
git checkout branch-2.3
patch -p0 </tmp/HIVE-12679.branch-2.3.patch
mvn clean install -DskipTests
- Build the AWS Glue Catalog Client for the patched Hive Metastore
git clone https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore.git
cd aws-glue-data-catalog-client-for-apache-hive-metastore/
mvn clean install -DskipTests # This actually failed on Hive3 build, but that's okay since I'm only interested in the build of Spark libraries
- Gather all the (Spark relevant!) JARs to one place (to later include them to $SPARK_HOME/jars)
mkdir /tmp/hive-jars
find ~/.m2/repository/org/apache/hive/ -type f -name "*.jar" | grep /2.3 | grep -v -- '-tests' | xargs -I{} cp '{}' /tmp/hive-jars/
find ~/.m2/repository/com/amazonaws/glue/ -type f -name "*.jar" | grep -vE 'shim|-tests' | xargs -I{} cp '{}' /tmp/hive-jars/
find ~/.m2/repository/org/apache/thrift -type f -name "libthrift-0.1*.jar" | xargs -I{} cp '{}' /tmp/hive-jars/
I have created the release file at this point, which you can download below ⬇.