Big data playground: Hadoop + Hive + Spark

Base Docker image with just essentials: Hadoop, Hive and Spark.

Software

Hadoop 3.2.0 in Fully Distributed (Multi-node) Mode
Hive 3.1.2 with HiveServer2 exposed to host.
Spark 2.4.5 in YARN mode (Spark Scala, PySpark and SparkR)

Usage

Take a look at this repo to see how I use it as a part of a Docker Compose cluster.

Hive JDBC port is exposed to host:

URI: jdbc:hive2://localhost:10000
Driver: org.apache.hive.jdbc.HiveDriver (org.apache.hive:hive-jdbc:3.1.2)
User and password: unused.

Version compatibility notes

Hadoop 3.2.1 and Hive 3.1.2 are incompatible due to Guava version mismatch (Hadoop: Guava 27.0, Hive: Guava 19.0). Hive fails with java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)
Spark 2.4.4 can not use Hive higher than 1.2.2 as a SparkSQL engine because of this bug: Spark need to support reading data from Hive 2.0.0 metastore and associated issue Dealing with TimeVars removed in Hive 2.x. Trying to make it happen results in this exception: java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT. When this is fixed in Spark 3.0, it will be able to use Hive as a backend for SparkSQL. Alternatively you can try to downgrade Hive :)

Maintaining

Docker file code linting: docker run --rm -i hadolint/hadolint < Dockerfile
To trim the fat from Docker image

TODO

Upgrade spark to 3.0
When upgraded, enable Spark-Hive integration.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
conf		conf
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
entrypoint.sh		entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big data playground: Hadoop + Hive + Spark

Software

Usage

Version compatibility notes

Maintaining

TODO

About

Releases 5

Packages

Contributors 2

Languages

License

panovvv/hadoop-hive-spark-docker

Folders and files

Latest commit

History

Repository files navigation

Big data playground: Hadoop + Hive + Spark

Software

Usage

Version compatibility notes

Maintaining

TODO

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 2

Languages

Packages