The main challenge is ingest data from a CSV and API using only Serverless Services from AWS.
Our solution includes the usage of five main technologies:
- ECS: As the platform to run our containers
- Glue: As the platform to run our ETL codes.
- Spark: as the main language to process the data
- Metabase: as the data visualization tool
- Athena: as the query engine to retrieve data from s3 and deliver to Metabase to build the dashboards
- Apache Airflow: as the scheduling and orchestration tool
At Apache Airflow, it was implemented a DAG called star_schema.py
to perform the ETL. It has the following opereators:
There are only one type of operator used in this DAG which is the PythonOperator that will be responsible to run a function that will trigger the Glue Job passing the required arguments for the job. Also, the same function will perform the handling of the job execution and wait until the job succeed or fail.
There is only one file that is being executed by all jobs on Glue which is the glue/job_template.py. Basically, this job is calling the ETLFactory class to build each class that will be executed in the Glue Job. In the image below you can check a diagram of how the packages
directory is structured.
- Logger: A class that implements/configure a logger for the project
- RestApiHook: A Hook for Rest API which implements the requests methods
- DatabaseManager: A class specialized in perform actions in Postgres. Implements sql-alchemy
- ETLBase: A base abstraction for the Extractor, Transformer and Loader classes. It simply implements the Logger class as composition and set a class-level attribute with the 'root' path of the 'filesystem'.
- RestApiExtractor: An Abstract class for Extractor classes specialized in extract from Rest APIs
- Transformer: An Abstraction that sets the structure for a Transformer class which will be responsible to perform the transformation of the extracted data
- Loader: The Loader class will be responsible to load/write data (extracted, transformed) to the Database or the Filesystem.
- ETLFactory: The ETLFactory is a implementation of Factory which is responsible for build the Extractors, Transformers and Loader classes.
Using metabase connected to Athena it was generated the three following graphs:
- Relation between total of services provided by a bank and the number of complains/issues.
- TOP Banks with more complains/issues.
- TOP banks with free services (no fee).
-
Local Environment
-
Development Environment
On your terminal, execute the following cmd:
$ docker-compose up -d --build
- Airflow
url: http://localhost:8080/
- Metabase
url: http://localhost:3000/
On your terminal, execute the following cmd:
$ source deploy.sh
This documentation is located in the root of the project-4 folder. It is basically doing the following actions:
- Asking for some required arguments (such as AWS_KEY and PASS info if it isn't set as environment variables)
- Executing terraform
- Sending some required files to the emr bucket on S3
- Building and pushing the airflow's docker image (check the Dockerfile) to ECR.
On your terminal, execute the following cmd:
$ python architecture_diagram/architecture.py