Spark ETL Sample Project

The goal of this project is to perform an extract, transform and load (ETL) process to migrate data into a local Apache Spark cluster.

Language: Python
Technologies: Spark, GPG Encryption

ETL Process

Decrypt local GPG-encrypted CSV files
Load CSV tabular data into Spark DataFrames
Save DF data to Parquet files
Write query to determine average age
Write query to determine age at the 75th percentile

Approach

Ask questions to get clarification
Install Apache Spark (note: if you experience too much trouble with setting up spark locally, then you may use duckdb instead)
Write code in Python using data files and GPG keys stored in this repo
Commit code to your repo and share link