This project aims to build a data warehousing solution using Apache Hive for data storage and SQL querying, combined with PostgreSQL for relational data storage and analytics. The project utilizes Docker for containerization, allowing for a seamless setup and deployment.
- Store transaction data in Apache Hive.
- Design a dimensional model for the data warehouse.
- Apply ETL (Extract, Transform, Load) processes to load data into PostgreSQL tables.
- Connect PostgreSQL to Apache Superset for data visualization and analytics.
- Apache Hive: Data warehouse infrastructure built on top of Hadoop for data summarization and analysis.
- PostgreSQL: Relational database management system for structured data storage and queries.
- Apache Superset: Open-source data visualization tool for creating interactive dashboards.
- Docker: Containerization platform for managing application environments.
/data-warehousing-project
│
├── docker-compose.yml
├── etl
│ ├── energy_data.csv
│ ├── etl_.py
├── superset_visualization.png
└── README.md
Here is an example of a visualization created in Apache Superset:
Before you begin, ensure you have the following installed:
Clone this repository to your local machine:
git clone https://github.com/1abdelhalim/DWH-with-Apache-Hive-and-PostgreSQL
cd
DWH-with-Apache-Hive-and-PostgreSQL
Private
Run the following command to start all services defined in the docker-compose.yml
file:
docker-compose up -d
-
Access the PostgreSQL container:
docker exec -it <postgres-container-name> psql -U postgres
-
Create a database and user:
CREATE DATABASE energy_db; CREATE USER energy_user WITH PASSWORD 'energy_password'; GRANT ALL PRIVILEGES ON DATABASE energy_db TO energy_user;
-
Exit PostgreSQL:
\q
-
Access the Hive container:
docker exec -it <hive-container-name> bash
-
Start the Hive CLI:
hive
-
Create a Hive database and tables:
CREATE DATABASE energy_db; USE energy_db; CREATE TABLE Dim_Country ( country_id INT, country_name STRING ); CREATE TABLE Dim_category ( category STRING ); CREATE TABLE Dim_Year ( year_id INT, year INT ); CREATE TABLE Fact_Transactions ( transaction_id INT, quantity DOUBLE, unit STRING );
-
Exit Hive:
exit;
Ensure the etl.py
uses the correct PostgreSQL user and database credentials:
postgres_engine = create_engine('postgresql://energy_user:energy_password@localhost:5432/energy_db')
Run the ETL script to extract data from the CSV file and load it into PostgreSQL and Hive:
python etl/etl.py
- Open your web browser and navigate to
http://localhost:8088
. - Create an admin account if prompted and log in.
- Connect to PostgreSQL and Hive databases under Data > Databases.
- Go to Charts to create new visualizations using your data.
- Use the Dashboards feature to compile your charts into an interactive dashboard.