Real-Time Data Streaming | Comprehensive Data Engineering Project

Introduction

This project provides a detailed guide to creating a complete data engineering pipeline. It walks you through each phase from data ingestion to processing and storage, using a robust technology stack including Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. The entire setup is containerized with Docker for simplified deployment and scalability.

System Architecture

The architecture of the project includes:

Data Source: Utilizes the randomuser.me API to generate random user data for the pipeline.
Apache Airflow: Manages pipeline orchestration and stores data in a PostgreSQL database.
Apache Kafka and Zookeeper: Facilitate data streaming from PostgreSQL to the processing engine.
Control Center and Schema Registry: Monitor and manage the schema of Kafka streams.
Apache Spark: Handles data processing with a master-worker setup.
Cassandra: Stores the processed data.

Learning Outcomes

How to establish a data pipeline using Apache Airflow
Techniques for real-time data streaming with Apache Kafka
Distributed synchronization using Apache Zookeeper
Data processing with Apache Spark
Data storage methods with Cassandra and PostgreSQL
Containerization of the entire data engineering workflow with Docker

Technologies Used

Apache Airflow
Python
Apache Kafka
Apache Zookeeper
Apache Spark
Cassandra
PostgreSQL
Docker

For detailed setup instructions, refer to the video tutorial linked below.

SOURCE Tutorial

Watch the original YouTube Video Tutorial.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dags		dags
script		script
docker-compose.yml		docker-compose.yml
image.png		image.png
readme.md		readme.md
requirements.txt		requirements.txt
spark_stream.py		spark_stream.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Data Streaming | Comprehensive Data Engineering Project

Introduction

System Architecture

Learning Outcomes

Technologies Used

SOURCE Tutorial

About

Releases

Packages

Contributors 2

Languages

aliabusaleh/Realtime-DataStreaming

Folders and files

Latest commit

History

Repository files navigation

Real-Time Data Streaming | Comprehensive Data Engineering Project

Introduction

System Architecture

Learning Outcomes

Technologies Used

SOURCE Tutorial

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages