Skip to content

SamuelBFavarin/door2door

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

door2door πŸš™

Door2Door is a data pipeline that ingests GPS sensor data from vehicles, and creates various data models following a bronze/silver/gold data warehouse structure. The pipeline provides useful insights, such as the average distance traveled by vehicles during a specific operating period.

The project uses Python for the API ingestion step and dbt for data quality checks and transformations. The entire project is hosted on Google Cloud Platform (GCP) and leverages various GCP technologies, including:

  • Cloud Functions, which execute the Python ingestion process

  • Cloud Storage, which stores raw data and hosts dbt documentation

  • BigQuery, which serves as the data warehouse

  • Cloud Build, which executes the dbt Docker and runs the transformation and quality checks

  • Cloud Scheduler, which serves as the orchestrator

Documentation πŸ“ƒ

dbt DW documentation:

Follow the DW documentation generated by dbt. Click here

Lineage

Solution Diagram:

This is the diagram of the pipeline v1 (current version):

  • Source data on Cloud Storage bucket;
  • Orchestration via Cloud Scheduler;
  • Ingestion script in python hosted on Cloud Function (Source to DW);
  • Transformation process using dbt hosted on Cloud Build;
  • Bigquery as a Data Warehouse, having bronze, silver and gold datasets.

Door2door pipeline v1

This is the diagram of the pipeline v2 (future/idealised version):

  • Source data on Cloud Storage bucket;
  • Orchestration via Airflow on Cloud Composer;
  • Ingestion script in python hosted on Cloud Run (Source to DW);
  • Transformation process using dbt hosted on Cloud Build;
  • Bigquery as a Data Warehouse, having bronze, silver and gold datasets.

Door2door pipeline v2

Pipeline Orchestration ⏰

  • The ingestion process runs at 00:00 UTC, and will ingest all data of day before.
  • The transformation process runs at 00:15 UTC, and all modes with daily will run.

How to run πŸ’»

To run the Door2Door project locally, follow these steps:

  1. Clone the repo in your local machine;
  2. Request and donwload a GCP service account;
  3. Add the Service Account Credentials in the /credentials/sa-gcp-key.json ;
  4. Use Makefile to run the ingestions and transformation steps. You need to run these comands in the repo folder!
    • To execute the ingestion process, run make run-ingestion. By default, this command ingests all data from the previous day;
    • To execution the transformation process, run make run-dbt;

How to access the GCP Project ☁️

To access the bellow links, you must have access for GCP. If you want to check it, please let me know via [email protected] informing your e-mail, and I can provide the access for you!