Door2Door is a data pipeline that ingests GPS sensor data from vehicles, and creates various data models following a bronze/silver/gold data warehouse structure. The pipeline provides useful insights, such as the average distance traveled by vehicles during a specific operating period.
The project uses Python for the API ingestion step and dbt for data quality checks and transformations. The entire project is hosted on Google Cloud Platform (GCP) and leverages various GCP technologies, including:
-
Cloud Functions, which execute the Python ingestion process
-
Cloud Storage, which stores raw data and hosts dbt documentation
-
BigQuery, which serves as the data warehouse
-
Cloud Build, which executes the dbt Docker and runs the transformation and quality checks
-
Cloud Scheduler, which serves as the orchestrator
Follow the DW documentation generated by dbt
. Click here
This is the diagram of the pipeline v1 (current version):
- Source data on Cloud Storage bucket;
- Orchestration via Cloud Scheduler;
- Ingestion script in python hosted on Cloud Function (Source to DW);
- Transformation process using dbt hosted on Cloud Build;
- Bigquery as a Data Warehouse, having bronze, silver and gold datasets.
This is the diagram of the pipeline v2 (future/idealised version):
- Source data on Cloud Storage bucket;
- Orchestration via Airflow on Cloud Composer;
- Ingestion script in python hosted on Cloud Run (Source to DW);
- Transformation process using dbt hosted on Cloud Build;
- Bigquery as a Data Warehouse, having bronze, silver and gold datasets.
- The ingestion process runs at 00:00 UTC, and will ingest all data of day before.
- The transformation process runs at 00:15 UTC, and all modes with
daily
will run.
To run the Door2Door project locally, follow these steps:
- Clone the repo in your local machine;
- Request and donwload a GCP service account;
- Add the Service Account Credentials in the
/credentials/sa-gcp-key.json
; - Use
Makefile
to run the ingestions and transformation steps. You need to run these comands in the repo folder!- To execute the ingestion process, run
make run-ingestion
. By default, this command ingests all data from the previous day; - To execution the transformation process, run
make run-dbt
;
- To execute the ingestion process, run
To access the bellow links, you must have access for GCP. If you want to check it, please let me know via [email protected]
informing your e-mail, and I can provide the access for you!