Skip to content

Latest commit

 

History

History
108 lines (70 loc) · 4.75 KB

spark-standalone-worker-scripts.adoc

File metadata and controls

108 lines (70 loc) · 4.75 KB

Management Scripts for Standalone Workers

sbin/start-slave.sh script starts a Spark worker (aka slave) on the machine the script is executed on. It launches SPARK_WORKER_INSTANCES instances.

./sbin/start-slave.sh [masterURL]

The mandatory masterURL parameter is of the form spark://hostname:port, e.g. spark://localhost:7077. It is also possible to specify a comma-separated master URLs of the form spark://hostname1:port1,hostname2:port2,…​ with each element to be hostname:port.

Internally, the script starts sparkWorker RPC environment.

The order of importance of Spark configuration settings is as follows (from least to the most important):

System environment variables

The script uses the following system environment variables (directly or indirectly):

  • SPARK_WORKER_INSTANCES (default: 1) - the number of worker instances to run on this slave.

  • SPARK_WORKER_PORT - the base port number to listen on for the first worker. If set, subsequent workers will increment this number. If unset, Spark will pick a random port.

  • SPARK_WORKER_WEBUI_PORT (default: 8081) - the base port for the web UI of the first worker. Subsequent workers will increment this number. If the port is used, the successive ports are tried until a free one is found.

  • SPARK_WORKER_CORES - the number of cores to use by a single executor

  • SPARK_WORKER_MEMORY (default: 1G)- the amount of memory to use, e.g. 1000M, 2G

  • SPARK_WORKER_DIR (default: $SPARK_HOME/work) - the directory to run apps in

The script uses the following helper scripts:

  • sbin/spark-config.sh

  • bin/load-spark-env.sh

Command-line Options

You can use the following command-line options:

  • --host or -h sets the hostname to be available under.

  • --port or -p - command-line version of SPARK_WORKER_PORT environment variable.

  • --cores or -c (default: the number of processors available to the JVM) - command-line version of SPARK_WORKER_CORES environment variable.

  • --memory or -m - command-line version of SPARK_WORKER_MEMORY environment variable.

  • --work-dir or -d - command-line version of SPARK_WORKER_DIR environment variable.

  • --webui-port - command-line version of SPARK_WORKER_WEBUI_PORT environment variable.

  • --properties-file (default: conf/spark-defaults.conf) - the path to a custom Spark properties file

  • --help

Spark properties

After loading the default SparkConf, if --properties-file or SPARK_WORKER_OPTS define spark.worker.ui.port, the value of the property is used as the port of the worker’s web UI.

SPARK_WORKER_OPTS=-Dspark.worker.ui.port=21212 ./sbin/start-slave.sh spark://localhost:7077

or

$ cat worker.properties
spark.worker.ui.port=33333

$ ./sbin/start-slave.sh spark://localhost:7077 --properties-file worker.properties

sbin/spark-daemon.sh

Ultimately, the script calls sbin/spark-daemon.sh start to kick off org.apache.spark.deploy.worker.Worker with --webui-port, --port and the master URL.

Internals of org.apache.spark.deploy.worker.Worker

Upon starting, a Spark worker creates the default SparkConf.

It parses command-line arguments for the worker using WorkerArguments class.

  • SPARK_LOCAL_HOSTNAME - custom host name

  • SPARK_LOCAL_IP - custom IP to use (when SPARK_LOCAL_HOSTNAME is not set or hostname resolves to incorrect IP)

It starts sparkWorker RPC Environment and waits until the RpcEnv terminates.

RPC environment

The org.apache.spark.deploy.worker.Worker class starts its own sparkWorker RPC environment with Worker endpoint.

sbin/start-slaves.sh script starts slave instances

The ./sbin/start-slaves.sh script starts slave instances on each machine specified in the conf/slaves file.

It has support for starting Tachyon using --with-tachyon command line option. It assumes tachyon/bin/tachyon command be available in Spark’s home directory.

The script uses the following helper scripts:

  • sbin/spark-config.sh

  • bin/load-spark-env.sh

  • conf/spark-env.sh

The script uses the following environment variables (and sets them when unavailable):

  • SPARK_PREFIX

  • SPARK_HOME

  • SPARK_CONF_DIR

  • SPARK_MASTER_PORT

  • SPARK_MASTER_IP

The following command will launch 3 worker instances on each node. Each worker instance will use two cores.

SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./sbin/start-slaves.sh