Base Docker image with just essentials: Hadoop, Hive and Spark.
-
Hadoop 3.2.0 in Fully Distributed (Multi-node) Mode
-
Hive 3.1.2 with HiveServer2 exposed to host.
-
Spark 2.4.5 in YARN mode (Spark Scala, PySpark and SparkR)
Take a look at this repo to see how I use it as a part of a Docker Compose cluster.
Hive JDBC port is exposed to host:
- URI:
jdbc:hive2://localhost:10000
- Driver:
org.apache.hive.jdbc.HiveDriver
(org.apache.hive:hive-jdbc:3.1.2) - User and password: unused.
- Hadoop 3.2.1 and Hive 3.1.2 are incompatible due to Guava version
mismatch (Hadoop: Guava 27.0, Hive: Guava 19.0). Hive fails with
java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)
- Spark 2.4.4 can not
use Hive higher than 1.2.2 as a SparkSQL engine
because of this bug: Spark need to support reading data from Hive 2.0.0 metastore
and associated issue Dealing with TimeVars removed in Hive 2.x.
Trying to make it happen results in this exception:
java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
. When this is fixed in Spark 3.0, it will be able to use Hive as a backend for SparkSQL. Alternatively you can try to downgrade Hive :)
- Docker file code linting:
docker run --rm -i hadolint/hadolint < Dockerfile
- To trim the fat from Docker image
- Upgrade spark to 3.0
- When upgraded, enable Spark-Hive integration.