The goal of this project is to demonstrate my ability to consume data from an API, transform it, and persist it into a data lake following the medallion architecture with three layers: raw data, curated data partitioned by location, and an analytical aggregated layer.
- API: Use the Open Brewery DB API to fetch data about breweries.
- Orchestration Tool: The pipeline is built using Apache Airflow to handle scheduling, retries, and error handling.
- Language: The project is implemented in Python. Test cases are included to ensure the code works as expected.
- Containerization: Docker is used to modularize and run the application.
- Data Lake Architecture:
- Bronze Layer: Raw data is persisted from the API in its native format.
- Silver Layer: Data is transformed into a columnar storage format (parquet) and partitioned by brewery location.
- Gold Layer: An aggregated view is created with the quantity of breweries per type and location.
AB_InBev/ │ ├── dags/ │ ├── brewery_pipeline.py │ ├── scripts/ │ ├── fetch_data.py │ ├── transform_data.py │ ├── aggregate_data.py │ ├── data/ │ ├── bronze/ │ ├── silver/ │ ├── gold/ │ ├── Dockerfile ├── docker-compose.yml ├── requirements.txt └── README.md
- Docker
- Docker Compose
- AWS account with access to S3
-
Clone the repository:
git clone https://github.com/ste-aina/brewery-data.git cd AB_InBev
-
Create a
.env
file in the root of the project with the following content:AWS_ACCESS_KEY_ID='your_aws_access_key_id' AWS_SECRET_ACCESS_KEY='your_aws_secret_access_key'
-
Build and start the containers:
docker-compose up --build
-
Access the Airflow web interface at
http://localhost:8080
and trigger thebrewery_pipeline
DAG.
Ensure your AWS credentials are correctly set up in the .env
file. Do not commit these credentials to the repository.
To implement monitoring and alerting:
- Use Airflow's built-in logging and email alerting features.
- Set up alerts for task failures and data quality issues.
- Monitor the S3 bucket for data consistency and completeness.
- DAGs: The
brewery_pipeline.py
file defines the Airflow DAG and tasks for fetching, transforming, and aggregating the data. - Scripts: Individual Python scripts for each step in the data pipeline.
- Docker: Dockerfile and docker-compose.yml are used for containerization and orchestration.
This project demonstrates the ability to build a robust data pipeline using Airflow, Docker, and AWS S3, following best practices for data engineering.
Feel free to reach out if you have any questions or need further assistance.