Docker container for Kafka - Spark streaming

This Dockerfile sets up a complete streaming environment for experimenting with Kafka, Spark streaming (PySpark) and jupyter. It installs

Kafka
Spark 2.1.1 for Scala 2.11

It additionally installs

Anaconda distribution 4.4.0 for Python 3.6
Jupyter notebook for Python

Quick start-up guide

Note that any changes you make in the notebook will be lost once you exit de container. In order to keep the changes, it is necessary put your notebooks in a folder on your host, that you share with the container, using for example

Note:

The "-v pwd:/home/guest/host" shares the local folder (i.e. folder containing Dockerfile, ipynb files, etc...) on your computer - the 'host') with the container in the '/home/guest/host' folder.
Port are shared as follows:
- 4040 bridges to Spark UI
- 8888 bridges to the Jupyter Notebook
- 23 bridges to SSH

SSH allows to get a connection to the container

ssh -p 23 guest@containerIP

where 'containerIP' is the IP of th container (127.0.0.1 on Linux). Password is 'guest'.

Start services

Once run, you are logged in as root in the container. Run the startup_script.sh (in /usr/bin) to start

SSH server. You can connect to the container using user 'guest' and password 'guest'
Zookeeper server
Kafka server

startup_script.sh

Connect, open notebook and start streaming

Connect as user 'guest' and go to 'host' folder (shared with the host)

su guest

Start Jupyter notebook

notebook

and connect from your browser at port host:8888 (where 'host' is the IP for your host. If run locally on your computer, this should be 127.0.0.1 or 192.168.99.100, check Docker documentation)

Start Kafka producer

Open kafkaSendDataPy.ipynb and run all cells.

Start Kafka receiver

Open kafkaReceiveAndSaveToCassandraPy.ipynb and run cells up to start streaming. Check in subsequent cells that Cassandra collects data properly.

Connect to Spark UI

It is available in your browser at port 4040

Container configuration details

The container is based on CentOS 6 Linux distribution. The main steps of the building process are

Install some common Linux tools (wget, unzip, tar, ssh tools, ...), and Java (1.8)
Create a guest user (UID important for sharing folders with host!, see below), and install Spark and sbt, Kafka, Anaconda and Jupyter notbooks for the guest user
Go back to root user, and install startup script (for starting SSH), sentenv.sh script to set up environment variables (JAVA, Kafka, Spark, ...), spark-default.conf

User UID

In the Dockerfile, the line

RUN useradd guest -u 1000

creates the user under which the container will be run as a guest user. The username is 'guest', with password 'guest', and the '-u' parameter sets the linux UID for that user.

In order to make sharing of folders easier between the container and your host, make sure this UID matches your user UID on the host. You can see what your host UID is with

echo $UID

Build and running the container from scratch

Clone this repository

git clone https://github.com/Yannael/kafka-sparkStreaming-jupyter-notebook

Build

From Dockerfile folder, run

docker build -t jupyter_spark_kafka .

jupyter_spark_kafka is the tag name we need to specify. You can use any other name in liu of 'jupyter_spark_kafka', use the same tag for both the commands.

It may take about 30 minutes to complete.

Run

docker run -v `pwd`:/home/guest/host -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged jupyter_spark_kafka

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
notebooks		notebooks
Dockerfile		Dockerfile
README.md		README.md
setenv.sh		setenv.sh
startup_script.sh		startup_script.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docker container for Kafka - Spark streaming

Quick start-up guide

Start services

Connect, open notebook and start streaming

Start Kafka producer

Start Kafka receiver

Connect to Spark UI

Container configuration details

User UID

Build and running the container from scratch

Clone this repository

Build

Run

About

Releases

Packages

Languages

awalin/kafka-sparkStreaming-jupyter-notebook

Folders and files

Latest commit

History

Repository files navigation

Docker container for Kafka - Spark streaming

Quick start-up guide

Start services

Connect, open notebook and start streaming

Start Kafka producer

Start Kafka receiver

Connect to Spark UI

Container configuration details

User UID

Build and running the container from scratch

Clone this repository

Build

Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages