This project provides a detailed guide to creating a complete data engineering pipeline. It walks you through each phase from data ingestion to processing and storage, using a robust technology stack including Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. The entire setup is containerized with Docker for simplified deployment and scalability.
The architecture of the project includes:
- Data Source: Utilizes the
randomuser.me
API to generate random user data for the pipeline. - Apache Airflow: Manages pipeline orchestration and stores data in a PostgreSQL database.
- Apache Kafka and Zookeeper: Facilitate data streaming from PostgreSQL to the processing engine.
- Control Center and Schema Registry: Monitor and manage the schema of Kafka streams.
- Apache Spark: Handles data processing with a master-worker setup.
- Cassandra: Stores the processed data.
- How to establish a data pipeline using Apache Airflow
- Techniques for real-time data streaming with Apache Kafka
- Distributed synchronization using Apache Zookeeper
- Data processing with Apache Spark
- Data storage methods with Cassandra and PostgreSQL
- Containerization of the entire data engineering workflow with Docker
- Apache Airflow
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Cassandra
- PostgreSQL
- Docker
For detailed setup instructions, refer to the video tutorial linked below.
Watch the original YouTube Video Tutorial.