This project consumes REST Countries API and parses its data into dataframes. There are several informations available and it was chosen for being open source to practice data engineering techniques.
Data from all countries are recieved as a list of dictionaries in a json response.
Country class is responsible for parsing and assembling tables for each country contained in the response.
CountriesHook makes a GET request at All countries endpoint and save a json file with information. Then parses and concatenates information from every country to save all tables at the data folder. This project uses Pandas for manipulating data, so all pandas files formats are supported.
To start using this repository, create a project folder, open a terminal and Git Clone using the countries_dw url.
git clone https://github.com/canutera/countries_dw.git
This project uses Poetry to manage its dependencies. To use Poetry, make sure you have this python package installed. If you don't, run the following command:
pip install poetry
Then run:
# add commented line to create venv inside the project folder
# poetry config virtualenvs.in-project true
poetry install
To install the project dependencies inside the current folder. After installing run:
poetry shell
activate
Your brand new virtual envinroment should be setup and ready to use after these steps.
The example below, shows a simple usage for countries_dw project which can be found in main.py file:
from src.countries_hook import CountriesHook
file_format = 'csv'
hook = (CountriesHook()
.parse_countries()
.parse_tables()
.save_tables(file_format, index=False)
)
If you run the following code in a Jupyter Notebook, make sure you are connected to the created virtual environment for this project. After running the code above, all tables should appear in the data folder. Just make sure, a valid format was passed. You can check pandas.DataFrame Documentation to look for valid formats.
I chose to set up PostgreSQL database on WSL, because I intend to orchestrate this whole project using Apache Airflow but still do not fully understand how to make it using docker. And because Airflow only works in a Linux environment, I ended up learning to set up in a WSL due to not having access to a Docker License in my work. (This might a good option if you are in the same situation).
A docker implementation will be later added to this repository.
To install WSL Distro in your machine, open a PowerShell terminal and type:
WSL --install Ubuntu
You can choose any distro, in this case I chose Ubuntu. This may take some minutes, an after installing, you will be asked to create a user with a password.
After setting user, run these commands below:
sudo apt update && sudo apt upgrade
# Update pip
sudo apt install python3-pip
# Install Postgres 15
sudo apt install libpq-dev
# steps below are needed to get pubkey for postgres 15
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget -qO- https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo tee /etc/apt/trusted.gpg.d/pgdg.asc &>/dev/null
sudo apt update && sudo apt upgrade
sudo apt install postgresql postgresql-client -y
pip install psycopg2
sudo apt update && sudo apt upgrade
# Start the PostgreSQL service
sudo service postgresql start
# Go into the PostgreSQL interactive console to
# issue setup commands
sudo -u postgres psql
If all commands were run correctly, after this you should be in the Postgres terminal in Ubuntu. Run the following commands:
--create user
CREATE USER gabriel_canuto PASSWORD 'countries_dw';
--Create the countries metadata database and setup all permissions
CREATE DATABASE countries;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO gabriel_canuto;
ALTER USER gabriel_canuto SET search_path = public;
GRANT ALL PRIVILEGES ON DATABASE countries TO gabriel_canuto;
ALTER USER gabriel_canuto CREATEDB;
--connect to countries database
\c countries;
GRANT ALL ON SCHEMA public TO gabriel_canuto;
You might want to customize user and password.
Your Postgres database is now set up in WSL.
If you used Poetry to manage dependecies, dbt-core and dbt-postgres (which is our adapter) should be already installed. Also the model is already created and ready run in dbt folder. But first, we need to set up the dbt profile for this project in your machine.
The profiles.yml file can be found at:
C:\Users\<your_username>\.dbt\profiles.yml
Copy and paste the code below to set this profile to work in dbt
countries_dw:
outputs:
postgres:
type: postgres
threads: 2
host: 127.0.0.1
port: 5432
user: gabriel_canuto
pass: countries_dw
dbname: countries
schema: public
target: postgres
If you customized your user and password in Postgres, please change the info before saving the files in order to work properly.
To use the created model make sure your terminal is inside the src/dbt folder. If not, type:
cd src/dbt
Inside the dbt folder, run:
dbt debug
To check if the profile is set up correctly. If something goes wrong, you might want to:
- make sure you are inside src/dbt folder
- make sure database is running and accepting connections
- check your profiles.yml file again
If all checks passed, just run:
dbt seed
dbt run
To deploy this project to your database.
To check out the project documentation, please run:
dbt docs generate
dbt docs serve
A webserver on localhost:8080 will open in your browser.
Information on how this project was built
Data from countries api was parsed to csv files and saved in dbt seed folder.
Note: According to dbt seed docs this is a poor use of this feature. Seed is meant to keep track of relevant data using version control. But for the sake of simplicity and also because our data is small (less than 1MB), this feature was used to load data to the warehouse.
Seed information is hidden from docs and is declared as a source in the raw schema.
The following code was run in the project environment to save csvs in the dbt seed folder.
from src.hook.countries_hook import CountriesHook
file_format = 'csv'
hook = (CountriesHook()
.parse_countries()
.parse_tables()
.save_tables(file_format, destination='C:\Git\countries_dw\src\dbt\seeds', index=False)
)
Then in seed folder, a properties.yml file was configured to load data and hide seeds from documetation using the format below:
version: 2
seeds:
- name: table
config:
enabled: true
docs:
show: false
database: countries
schema: raw
column_types:
col1: int
col2: varchar(255)
Seeded files were declared as source in the model schema along with models for the staging schema including constraints.
Difficulties found along this project
I found difficult understanding how seeds, sources and models worked on dbt. I learned the hard way why the convention for naming models is <schema>_<model_name>. I had some names conflicts because I wanted to name a source and a model with the name "languages". But dbt uses the same function('ref') to refer both to sources and models.
Also learned the <-- depends_on: {{ ref('model') }}> hint to resolve dependencies in models. Because of concurrency sometimes, a model started to run before its dependencies were created, so this causes an error. But it can be solved with the depends on hint.
- Set up a WSL
- Set up a PostgreSQL database to store table
- Create data model and documentation
- Finish model creation and docs
- Create a airflow instance to orchestrate tasks
- Deploy project using docker
- Start data visualization for countries_dw