This repository contains the course project for the Data Engineering Zoomcamp (Cohort 2023) organized by the by DataTalks.Club community. The project covers main data engineering skills taught in the course:
- Workflow Orchestration: Data Lake, Prefect tool, ETL with GCP & Prefect
- Data Warehouse: BigQuery
- Analytics engineering: dbt (data build tool)
- Data Analysis and visualisation: Looker Studio
TL;DR: This project is analyzing the Global terrorism Dataset. Follow the steps mentioned
under How to reproduce this project?
to set it up.
The data has been downloaded from the Global Terrorism Database (GTD)
The Global Terrorism Database™ (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2021 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 200,000 cases.
Characteristics of the GTD
- 1970 to 2021 data
- Contains information on over 200,000 terrorist attacks
- Currently the most comprehensive unclassified database on terrorist attacks in the world
- Includes information on more than 88,000 bombings, 19,000 assassinations, and 11,000 kidnappings since 1970
- Includes information on at least 45 variables for each case, with more recent incidents including information on more than 120 variables
- More than 4,000,000 news articles and 25,000 news sources were reviewed to collect incident data from 1998 to 2017 alone
The dataset has about 120 columns, but for the present project I decided to select only the relevant columns for my analysis (27 in total). The following columns will be used:
# | Attribute | Description |
---|---|---|
1 | event_id | This is a unique identifier of the terrorism attack record. |
2 | event_date | Reconstruction of the date from attributes: year, month, day. |
3 | country | the country or location where the incident occurred. |
4 | region | the region in which the incident occurred. |
5 | provstate | the name (at the time of event) of the 1st order subnational administrative region in which the event occurs. |
6 | city | the name of the city, village, or town in which the incident occurred. |
7 | summary | A brief narrative summary of the incident, noting the “when, where, who, what, how, and why.”. |
8 | reason1 | The violent act must be aimed at attaining a political, economic, religious, or social goal. |
9 | reason2 | To satisfy this criterion there must be evidence of an intention to coerce, intimidate, or convey some other message to a larger audience (or audiences) than the immediate victims. |
10 | reason3 | The action is outside the context of legitimate warfare activities, insofar as it targets non-combatants |
11 | doubt_terrorism_proper | In certain cases there may be some uncertainty whether an incident meets all of the criteria for inclusion. |
12 | attack_type | the general method of attack and often reflects the broad class of tactics used. |
13 | target_type | captures the general type of target/victim. |
14 | weapon_type | the general type of weapon used in the incident. |
15 | perpetrator_group_name | the name of the group that carried out the attack. |
16 | nkill | the number of total confirmed fatalities for the incident. |
17 | nwound | the number of confirmed non-fatal injuries to both perpetrators and victims. |
- Cloud:
Google Cloud
- Infrastructure:
Terraform
- Orchestration:
Prefect
- Data lake:
Google Cloud Storage
- Data transformation:
DBT
- Data warehouse:
BigQuery
- Data visualization:
Google Looker Studio
- Which countries have had the most attacks?
- Which cities have had the most attacks?
- What is the most common type of attack?
- What are the main reasons for these attacks?
- What are the main targets?
- Who are the main perpetrators?
- The total number of deaths since the beginning
- The number of deaths per country
Click here to see the dashboard.
Step 1: Clone this repo and install necessary requirements
- Clone the repo into your local machine:
git clone https://github.com/lironesamoun/data-engineering-capstone-project.git
- Create a virtual env and install all required dependencies into your environment
make install
-
Create a Google Cloud Platform (GCP) free account with your Google e-mail
-
Create a new GCP project with the name de-capstone-project-23 (Note: Save the assigned Project ID. Projects have a unique ID and for that reason another ID will be assigned)
-
Create a Service Account:
- Go to IAM & Admin > Service accounts > Create service account
- Provide a service account name and grant the roles: Viewer + BigQuery Admin + Storage Admin + Storage Object Admin + Compute Storage Admin
- Download the Service Account json file
- Download SDK for local setup
- Set environment variable to point to your downloaded GCP keys in the .env file:
TF_VAR_GCP_CREDS="<path/to/your/service-account-authkeys>.json"
# Refresh token/session, and verify authentication gcloud auth application-default login
-
Enable the following APIs:
- Install Terraform
- Execute the following commands to plan the creation of the GCP infrastructure:
make init-infrastructure
It is possible to see in the GCP console that the Infrastructure was correctly created.
At the end of the process, you should see the bucket created and the table in Google Big Query. You can change the name of the variable inside the .env file.
- Setup the prefect server so that you can access the UI. Run the following command in a CL terminal:
make prefect-init
make prefect-start
- Access the UI in your browser: http://127.0.0.1:4200/
- For the connection with GCP Buckets it is necessary to init prefect blocks
Fill in the github env variable:
GITHUB_REPOSITORY_URL="your repo"
GITHUB_REPO_ACCESS_TOKEN=""
Then:
make prefect-init-blocks
- To execute the flow, run the following commands in a different CL terminal than step 1:
python3 src/flows/parameterized_flow_http_pipeline.py
- Create a dbt cloud free account
- Clone this repo
- In the command line of dbt running the following command:
dbt run --vars 'is_test_run: false'
You should see the following lineage in DBT:
Create a production environment on DBT. Add the following command:
dbt seed
dbt build --vars 'is_test_run: false'
dbt test
Execute the flow.
After running all those steps, you should see in Google Big query the following table created
make destroy-infrastructure
make uninstall
- Add unit tests
- Improvement of CI/CD pipeline
- Containerize the project
- Perform deeper data analysis
- Add pipeline for Machine Learning
prefect agent start -p 'default-agent-pool'
prefect deployment build -n "Online Parameterized ETL" -p default-agent-pool -q main-queue
src/flows/parameterized_flow_http_pipeline.py:end_to_end_pipeline_from_http_to_bq
prefect deployment apply end_to_end_pipeline_from_http_to_bq-deployment.yaml