Skip to content

Latest commit

 

History

History
85 lines (60 loc) · 3.05 KB

README.md

File metadata and controls

85 lines (60 loc) · 3.05 KB

data-playground

data-playground logo

Set up a cluster of debian servers with hadoop and spark.

Local cluster setup

To set up a cluster in your dev machine, install vagrant and ansible

Then in the project root, run vagrant up.

This should download and install everything in a local 3-node cluster if this is the first time.

Afterwards, you can run vagrant up to start up the cluster. If you want to run through the setting up process you will need to include the provisioning flag i.e. vagrant up --provision, or run vagrant provision after the cluster is up.

After provisioning, the HDFS web UI will be available at http://192.168.56.10:9870

You can then SSH into the master node with vagrant ssh master and start the various services as follows.

Note that Spark is configured in standalone mode by default so it is not dependent on Yarn. To make Spark run with Yarn as the resource manager, change the value of spark_master in vars.yml to yarn and shut down spark standalone if it is already running with ./spark/sbin/stop-all.sh.

To shut down the local cluster, run vagrant halt. To delete all the resources associated with the cluster, run vagrant destroy.

Syncing files to local master node

By default, vagrant will sync the the project root (where the Vagrantfile is located) to /vagrant within each cluster VM.

Additionally, a directory named projects will be created in the project's root directory and its contents will synced to the directory /projects in the master node. You can place any code and data that you want to access from within the master node in it, and after you ssh into master via vagrant ssh master, you'll find them in /projects.

Any files in the projects folder will be ignored by git.

This can be disabled by removing the relevant section in the Vagrantfile. Search for 'disable syncing'.

Access URLs via hostnames

The local cluster will use hostnames configured internally for communication amongst the nodes, and the jobs URLs may contain these hostname. To access these URLs, one would have to replace the hostnames with the related IP as configured in the Vagrantfile. For example replacing master in the URL with 192.168.56.10.

However, you can configured your host machine to recognize these URLs by updating the /etc/hosts file by adding the following lines at the end.

# For use with data-playground cluster
192.168.56.10 master
192.168.56.11 worker1
192.168.56.12 worker2