Running a PySpark Job

This repo aims to provide a systematic and structured flow of data processing by providing a single entrypoint (main.py) for all the stored jobs.

Running a PySpark Job

Every job module must be located inside src/jobs and can be run via

make build
cd dist 
spark-submit --py-files jobs.zip main.py --job <jobName>

What does `make build` do?

...

Third-party dependencies

PySpark provides multiple ways to manage package Python dependencies making them available inside jobs. Please visit the Python Package Management site for more details.

However, I find the Virtualenv approach the most straightforward. Say you have a virtualenv my-env created with python3 -m venv my-env, you can package and save it to hdfs with

source my-env/bin/activate
pip3 install venv-pack
venv-pack -o my-env.tar.gz
hdfs dfs -put -f my-env.tar.gz <destination>

where <destination> can be, for example, /shared/python-envs. Then, use the --archives option in the spark-submit to make your virtual environment avaibale within your jobs,

spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--archives spark.yarn.dist.archives=hdfs:///shared/python-envs/my-env.tar.gz#environment \
--master yarn \
--deploy-mode cluster \
main.py --job <jobName>

Note

--conf spark.yarn.dist.archives can be used instead of --archives.

Writing a PySpark Job

PySpark jobs must be python modules exposing the run(spark: SparkSession, **kwargs) function. The main.py module will then try to import this function under the specified job module using the importlib library. This logic is depicted in the following code,

import importlib

jobArgs = {...} 
jobName = "my-job"
jobModule = importlib.import_module(f"jobs.{jobName}")
jobModule.run(spark=spark, **jobArgs)

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running a PySpark Job

What does `make build` do?

Third-party dependencies

Writing a PySpark Job

About

Releases

Packages

Languages

ramonamezquita/pyspark-boilerplate

Folders and files

Latest commit

History

Repository files navigation

Running a PySpark Job

What does make build do?

Third-party dependencies

Writing a PySpark Job

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

What does `make build` do?

Packages