Airflow Pipeline

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
Roadmap

About The Project

This project creates data pipelines using Airflow both locally and on AWS.

Local

The Airflow pipeline runs in parallel all the spark jobs that are defined in the "src/jobs" directory.

AWS

To create the AWS environment, run the ".sh" scripts inside the aws directory. These scripts will create 4 different stacks that are needed for Airflow.

Script "cicd_deploy.sh" creates the CI/CD part of the project. For this, AWS Codepipeline is used that sources from AWS Codecommit and deployes on S3. There is also an event rule that triggers the pipeline every time there's a commit.

Script "code_deploy.sh" creates the AWS Codecommit repository to store the code.

Script "data_deploy.sh" creates the S3 bucket with versioning and with all public access blocked.

Finally, script "template_deploy.sh" creates the core services such as Airflow, Redshift, and different networking services that are needed such as VPC.

Once all the stacks are deployed successfully, a managed Airflow environment will be created on AWS. Open the Airflow UI and run the dag. By running the dag, an EMR cluster will be created to run the jobs.

Built With

Datasets

I94 Immigration

The I94 Immigration dataset contains statistics regarding the international visitor arrivals split by world regions and countries, type of visa, mode of transportation, age groups, states visited, and the ports of entry.
The dataset can be found here.
The dataset files in the repo can be found inside the "src/data/sas_data/" directory.

World Temperature

The World Temperature dataset contains data regarding the monthly average temperature by country. This dataset is taken from Kaggle.
This data has to be downloaded and placed inside the "src/data/ directory.

U.S. City Demographic

The U.S. City Demographic dataset contains demographic information of US cities and census-designated places with a population greater or equal to 65,000.
More about the dataset can be found here here.
The dataset in the repo can be found at "src/data/us-cities-demographics.csv".

Airport Code

A simple table of Airport codes and corresponding cities. More about the dataset can be found here.
The dataset in the repo can be found at: "src/data/I94_SAS_Labels_Descriptions.SAS".

Getting Started

Local

To start Airflow locally, run:

bash run-local.sh

The Airflow UI can be accessed from localhost:8080.

To delete all created containers, run:

docker-compose down --volumes --rmi all

AWS

The AWS part of this project is split in 2 parts.

AWS CloudFormation stacks
The services are created using AWS's Cloudformation. There are 4 templates available that create the necessary tables. The first 3 are related to repositories and services that do not cost a lot and can remain active. The "template.yaml" contains the core services that should be deleted when experimentation is over.
Airflow
The Airflow Dag for AWS can be found at "src/aws_dags/". This directory contains the main dag as well as the helper function that contain configurations and the SQL scripts required for Redshift.

Some mandatory steps are needed to run it on AWS.

Fill the "S3BucketName" and "RedshiftMasterUserPassword" inside the "aws/parameters/parameters.json" file.
The "S3BucketName" is the unique bucket name that will be created and the "RedshiftMasterUserPassword" is the Redshift database password.
Fill the "IAM_REDSHIFT_ROLE" and "BUCKET_NAME" inside the "src/aws_dags/helpers/conf.py" file.
The "IAM_REDSHIFT_ROLE" is the IAM role ARN required to access Redshift. This can be found using the AWS console and accessing IAM Roles. The name should be in the format "arn:aws:iam<AccountNumber>:role/AirflowTemplate-AirflowRole-<id>". The "BUCKET_NAME" is the Bucket name used in "S3BucketName".
Fill the "BUCKET_NAME" inside the "bootstrap.sh" file.
When the template.yaml has been deployed. Go to the Airflow UI at Admin -> Connections. Create a new connection by pressing the "+" button on the upper left.
Add the following values:

Connection Id: redshift_default
Connection Type: Amazon Redshift
Host: The "Endpoint" value taken from the Redshift page in AWS console without the port and schema. Has the following format: <clusterIdentifier>.<id>.<region>.redshift.amazonaws.com.
Schema: The database name. Should be the same as the RedshiftDBName value in "aws/parameters/parameters.json".
Login: The Admin user name. The default used is "master".
Can also be found inside the Redshift page in AWS console at Properties -> Database configurations -> Admin user name.
Password: The database password. Should be the same as the "RedshiftMasterUserPassword" value in "aws/parameters/parameters.json".
Port: Database port. The default is 5439. Can also be found inside the Redshift page in AWS console at Properties -> Database configurations -> Port.

To remove the created services, just delete the stack from the CloudFormation page on the AWS console.

Prerequisites

Docker must be installed to run the docker files. AWS cli must be installed to run the AWS scripts.

Roadmap

Set local Airflow ☑
Install PySpark in Airflow image ☑
Create local Airflow jobs ☑
Create Airflow environment on AWS ☑
Create Airflow Pyspark jobs on AWS EMR ☑
Save EMR results on AWS Redshift ☑

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
aws		aws
docs		docs
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
airflow_requirements.txt		airflow_requirements.txt
bootstrap.sh		bootstrap.sh
docker-compose.yaml		docker-compose.yaml
job_requirements.txt		job_requirements.txt
run-local.sh		run-local.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airflow Pipeline

About The Project

Local

AWS

Built With

Datasets

I94 Immigration

World Temperature

U.S. City Demographic

Airport Code

Getting Started

Local

AWS

Prerequisites

Roadmap

About

Languages

SteliosGian/airflow-pipeline

Folders and files

Latest commit

History

Repository files navigation

Airflow Pipeline

About The Project

Local

AWS

Built With

Datasets

I94 Immigration

World Temperature

U.S. City Demographic

Airport Code

Getting Started

Local

AWS

Prerequisites

Roadmap

About

Topics

Resources

Stars

Watchers

Forks

Languages