Implement the ETL process which will be executed on a regular basis and will be responsible for cleaning, extracting and loading the data for later use in the business analysis. Final data model can be used for verifying the correlation between:
- destination temperature and immigration statistics
- destination in the U.S and the source country
- destination in the U.S and the source climates
- arrival month and number of immigrants
The project is based on the immigration dataset as a primary dataset and supplementary datasets like demographics, temperatures and aircodes.
The end solution will make use of the Airflow workflow system which will call all the ETL stages on monthly basis. For processing (cleaning/transforming) the immigration data there will be used the Apache Spark. The Apache Spark output will be saved into the S3 buckets. Finally, the saved data will be loaded into the Redshift cluster for the business analytics queries.
There are four datasets:
- Immigration to the United States (source)
- U.S. city demographics (source)
- Airport codes (source)
- Temperatures (source)
The main dataset is the immigration to the United States, and the rest are supplementary datasets.
Simplified diagram showing main datasets features.
Data definition and EDA (exploratory data analysis) is placed in the /jupyter folder.
ETL process implementation is placed in the /airflow folder.