Skip to content

1abdelhalim/DWH-with-Apache-Hive-and-PostgreSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Warehousing Project with Apache Hive and PostgreSQL

Project Overview

This project aims to build a data warehousing solution using Apache Hive for data storage and SQL querying, combined with PostgreSQL for relational data storage and analytics. The project utilizes Docker for containerization, allowing for a seamless setup and deployment.

Objectives

  • Store transaction data in Apache Hive.
  • Design a dimensional model for the data warehouse.
  • Apply ETL (Extract, Transform, Load) processes to load data into PostgreSQL tables.
  • Connect PostgreSQL to Apache Superset for data visualization and analytics.

Technologies Used

  • Apache Hive: Data warehouse infrastructure built on top of Hadoop for data summarization and analysis.
  • PostgreSQL: Relational database management system for structured data storage and queries.
  • Apache Superset: Open-source data visualization tool for creating interactive dashboards.
  • Docker: Containerization platform for managing application environments.

Project Structure

/data-warehousing-project
│
├── docker-compose.yml
├── etl
│   ├── energy_data.csv
│   ├── etl_.py
├── superset_visualization.png
└── README.md

Visualizations

Here is an example of a visualization created in Apache Superset:

Apache Superset Visualization

Prerequisites

Before you begin, ensure you have the following installed:

Getting Started

Step 1: Clone the Repository

Clone this repository to your local machine:

git clone https://github.com/1abdelhalim/DWH-with-Apache-Hive-and-PostgreSQL
cd 
DWH-with-Apache-Hive-and-PostgreSQL
Private

Step 2: Start Docker Containers

Run the following command to start all services defined in the docker-compose.yml file:

docker-compose up -d

Step 3: Create PostgreSQL Database and User

  1. Access the PostgreSQL container:

    docker exec -it <postgres-container-name> psql -U postgres
  2. Create a database and user:

    CREATE DATABASE energy_db;
    CREATE USER energy_user WITH PASSWORD 'energy_password';
    GRANT ALL PRIVILEGES ON DATABASE energy_db TO energy_user;
  3. Exit PostgreSQL:

    \q

Step 4: Set Up Hive

  1. Access the Hive container:

    docker exec -it <hive-container-name> bash
  2. Start the Hive CLI:

    hive
  3. Create a Hive database and tables:

    CREATE DATABASE energy_db;
    USE energy_db;
    
    CREATE TABLE Dim_Country (
        country_id INT,
        country_name STRING
    );
    
    CREATE TABLE Dim_category (
        category STRING
    );
    
    CREATE TABLE Dim_Year (
        year_id INT,
        year INT
    );
    
    CREATE TABLE Fact_Transactions (
        transaction_id INT,
        quantity DOUBLE,
        unit STRING
    );
  4. Exit Hive:

    exit;

Step 5: Modify the ETL Script

Ensure the etl.py uses the correct PostgreSQL user and database credentials:

postgres_engine = create_engine('postgresql://energy_user:energy_password@localhost:5432/energy_db')

Step 6: Run the ETL Script

Run the ETL script to extract data from the CSV file and load it into PostgreSQL and Hive:

python etl/etl.py

Step 7: Access Apache Superset

  1. Open your web browser and navigate to http://localhost:8088.
  2. Create an admin account if prompted and log in.
  3. Connect to PostgreSQL and Hive databases under Data > Databases.

Step 8: Create Visualizations

  1. Go to Charts to create new visualizations using your data.
  2. Use the Dashboards feature to compile your charts into an interactive dashboard.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages