US Immigration Analysis - Data Engineering Capstone Project

1. Project Scope

Main goal

Implement the ETL process which will be executed on a regular basis and will be responsible for cleaning, extracting and loading the data for later use in the business analysis. Final data model can be used for verifying the correlation between:

destination temperature and immigration statistics
destination in the U.S and the source country
destination in the U.S and the source climates
arrival month and number of immigrants

Data

The project is based on the immigration dataset as a primary dataset and supplementary datasets like demographics, temperatures and aircodes.

End solution

The end solution will make use of the Airflow workflow system which will call all the ETL stages on monthly basis. For processing (cleaning/transforming) the immigration data there will be used the Apache Spark. The Apache Spark output will be saved into the S3 buckets. Finally, the saved data will be loaded into the Redshift cluster for the business analytics queries.

2. Datasets

Description

There are four datasets:

Immigration to the United States (source)
U.S. city demographics (source)
Airport codes (source)
Temperatures (source)

The main dataset is the immigration to the United States, and the rest are supplementary datasets.

Diagaram

Simplified diagram showing main datasets features.

Data definition

Data definition and EDA (exploratory data analysis) is placed in the /jupyter folder.

ETL Implementation

ETL process implementation is placed in the /airflow folder.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Airflow		Airflow
Jupyter		Jupyter
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

US Immigration Analysis - Data Engineering Capstone Project

1. Project Scope

Main goal

Data

End solution

2. Datasets

Description

Diagaram

Data definition

ETL Implementation

About

Releases

Packages

Languages

jstachera/UdacityCapstoneProject

Folders and files

Latest commit

History

Repository files navigation

US Immigration Analysis - Data Engineering Capstone Project

1. Project Scope

Main goal

Data

End solution

2. Datasets

Description

Diagaram

Data definition

ETL Implementation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages