Data Warehousing Project with Apache Hive and PostgreSQL

Project Overview

This project aims to build a data warehousing solution using Apache Hive for data storage and SQL querying, combined with PostgreSQL for relational data storage and analytics. The project utilizes Docker for containerization, allowing for a seamless setup and deployment.

Objectives

Store transaction data in Apache Hive.
Design a dimensional model for the data warehouse.
Apply ETL (Extract, Transform, Load) processes to load data into PostgreSQL tables.
Connect PostgreSQL to Apache Superset for data visualization and analytics.

Technologies Used

Apache Hive: Data warehouse infrastructure built on top of Hadoop for data summarization and analysis.
PostgreSQL: Relational database management system for structured data storage and queries.
Apache Superset: Open-source data visualization tool for creating interactive dashboards.
Docker: Containerization platform for managing application environments.

Project Structure

/data-warehousing-project
│
├── docker-compose.yml
├── etl
│   ├── energy_data.csv
│   ├── etl_.py
├── superset_visualization.png
└── README.md

Visualizations

Here is an example of a visualization created in Apache Superset:

Prerequisites

Before you begin, ensure you have the following installed:

Getting Started

Step 1: Clone the Repository

Clone this repository to your local machine:

git clone https://github.com/1abdelhalim/DWH-with-Apache-Hive-and-PostgreSQL
cd 
DWH-with-Apache-Hive-and-PostgreSQL
Private

Step 2: Start Docker Containers

Run the following command to start all services defined in the docker-compose.yml file:

docker-compose up -d

Step 3: Create PostgreSQL Database and User

Access the PostgreSQL container:

docker exec -it <postgres-container-name> psql -U postgres

Create a database and user:

CREATE DATABASE energy_db;
CREATE USER energy_user WITH PASSWORD 'energy_password';
GRANT ALL PRIVILEGES ON DATABASE energy_db TO energy_user;

Exit PostgreSQL:
```
\q
```

Step 4: Set Up Hive

Access the Hive container:

docker exec -it <hive-container-name> bash

Start the Hive CLI:
```
hive
```

Create a Hive database and tables:

CREATE DATABASE energy_db;
USE energy_db;

CREATE TABLE Dim_Country (
    country_id INT,
    country_name STRING
);

CREATE TABLE Dim_category (
    category STRING
);

CREATE TABLE Dim_Year (
    year_id INT,
    year INT
);

CREATE TABLE Fact_Transactions (
    transaction_id INT,
    quantity DOUBLE,
    unit STRING
);

Exit Hive:
```
exit;
```

Step 5: Modify the ETL Script

Ensure the etl.py uses the correct PostgreSQL user and database credentials:

postgres_engine = create_engine('postgresql://energy_user:energy_password@localhost:5432/energy_db')

Step 6: Run the ETL Script

Run the ETL script to extract data from the CSV file and load it into PostgreSQL and Hive:

python etl/etl.py

Step 7: Access Apache Superset

Open your web browser and navigate to http://localhost:8088.
Create an admin account if prompted and log in.
Connect to PostgreSQL and Hive databases under Data > Databases.

Step 8: Create Visualizations

Go to Charts to create new visualizations using your data.
Use the Dashboards feature to compile your charts into an interactive dashboard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Warehousing Project with Apache Hive and PostgreSQL

Project Overview

Objectives

Technologies Used

Project Structure

Visualizations

Prerequisites

Getting Started

Step 1: Clone the Repository

Step 2: Start Docker Containers

Step 3: Create PostgreSQL Database and User

Step 4: Set Up Hive

Step 5: Modify the ETL Script

Step 6: Run the ETL Script

Step 7: Access Apache Superset

Step 8: Create Visualizations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
etl		etl
README.md		README.md
docker-compose.yml		docker-compose.yml
superset_visualization.png		superset_visualization.png

1abdelhalim/DWH-with-Apache-Hive-and-PostgreSQL

Folders and files

Latest commit

History

Repository files navigation

Data Warehousing Project with Apache Hive and PostgreSQL

Project Overview

Objectives

Technologies Used

Project Structure

Visualizations

Prerequisites

Getting Started

Step 1: Clone the Repository

Step 2: Start Docker Containers

Step 3: Create PostgreSQL Database and User

Step 4: Set Up Hive

Step 5: Modify the ETL Script

Step 6: Run the ETL Script

Step 7: Access Apache Superset

Step 8: Create Visualizations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages