Skip to content

Latest commit

 

History

History
113 lines (75 loc) · 4.48 KB

File metadata and controls

113 lines (75 loc) · 4.48 KB

Project 5 - ETL with Serverless Services (AWS)

1. Context

The main challenge is ingest data from a CSV and API using only Serverless Services from AWS.

2. Project

Our solution includes the usage of five main technologies:

  • ECS: As the platform to run our containers
  • Glue: As the platform to run our ETL codes.
  • Spark: as the main language to process the data
  • Metabase: as the data visualization tool
  • Athena: as the query engine to retrieve data from s3 and deliver to Metabase to build the dashboards
  • Apache Airflow: as the scheduling and orchestration tool

alt text

At Apache Airflow, it was implemented a DAG called star_schema.py to perform the ETL. It has the following opereators:

alt text

There are only one type of operator used in this DAG which is the PythonOperator that will be responsible to run a function that will trigger the Glue Job passing the required arguments for the job. Also, the same function will perform the handling of the job execution and wait until the job succeed or fail.

There is only one file that is being executed by all jobs on Glue which is the glue/job_template.py. Basically, this job is calling the ETLFactory class to build each class that will be executed in the Glue Job. In the image below you can check a diagram of how the packages directory is structured.

alt text

  • Logger: A class that implements/configure a logger for the project
  • RestApiHook: A Hook for Rest API which implements the requests methods
  • DatabaseManager: A class specialized in perform actions in Postgres. Implements sql-alchemy
  • ETLBase: A base abstraction for the Extractor, Transformer and Loader classes. It simply implements the Logger class as composition and set a class-level attribute with the 'root' path of the 'filesystem'.
  • RestApiExtractor: An Abstract class for Extractor classes specialized in extract from Rest APIs
  • Transformer: An Abstraction that sets the structure for a Transformer class which will be responsible to perform the transformation of the extracted data
  • Loader: The Loader class will be responsible to load/write data (extracted, transformed) to the Database or the Filesystem.
  • ETLFactory: The ETLFactory is a implementation of Factory which is responsible for build the Extractors, Transformers and Loader classes.

Data Visualization

Using metabase connected to Athena it was generated the three following graphs:

  • Relation between total of services provided by a bank and the number of complains/issues.

alt text

  • TOP Banks with more complains/issues.

alt text

  • TOP banks with free services (no fee).

alt text

3. How to Run

3.1 Airflow + Metabase

3.1.1 Requirements

3.1.2 Executing the project - Local Environment

On your terminal, execute the following cmd:

$ docker-compose up -d --build

3.1.3 Acessing the services:

  1. Airflow
url: http://localhost:8080/
  1. Metabase
url: http://localhost:3000/

3.1.3 Executing the project - Development Environment

On your terminal, execute the following cmd:

$ source deploy.sh

This documentation is located in the root of the project-4 folder. It is basically doing the following actions:

  • Asking for some required arguments (such as AWS_KEY and PASS info if it isn't set as environment variables)
  • Executing terraform
  • Sending some required files to the emr bucket on S3
  • Building and pushing the airflow's docker image (check the Dockerfile) to ECR.

3.2 Diagrams

3.2.1 Requirements

3.2.2 Generating the Diagram

On your terminal, execute the following cmd:

$ python architecture_diagram/architecture.py