Below is a Docker map that provides a visual overview of the interrelationships between all services integrated within the platform.
Before installing these data services, make sure your machine meets the following minimum system requirements:
- CPU >= 4 Core
- RAM >= 8GB
Due to the limitations in computing resources, I have decided to distribute data services based on their resource consumption rate and the interrelationships among all services, rather than consolidating evething on a single server 🔥
For this project, I chose to rent VPS from . It’s an excellent option for running side projects beside other providers like , ,
- Lightdash only accepts secure connections from Trino using username-password; connections without authentication are not permitted
- Trino must be hosted under SSL/TLS to utilize username and password authentication
- There are several methods to add SSL/TLS to Trino. For this project, I used a certificate from Certbot (Let’s Encrypt). Alternatively, you can implement SSL/TLS by setting up a proxy through Cloudflare 🫡
In this project, we need a domain name for several important reasons, particularly to enable SSL/TLS certificates for secure user-password authentication in Trino.
One great option for a free domain name is DuckDNS 🦆, which allows you to point a domain to your server easily
|
🤑 Alternatively, you can purchase a domain from providers like Name.com, Namecheap, or GoDaddy.
Beyond securing Trino, having a domain offers additional benefits. It simplifies access to other services, like Airflow Webserver, Lightdash, DBT documentation, and Spark UI, allowing you to use a recognizable domain instead of the less convenient : format for accessing these services 🙂
Two solutions you can try to implement SSL/TLS
Solution 1 | Solution 2 | |
---|---|---|
Participants | Cloudflare & Your own domain name |
Certbot & DuckDNS domain name |
Description | Add Cloudflare proxy to your DNS and enable secure connection options to implement SSL/TLS |
Obtain a certificate from a Certificate Authority (CA) and integrate it with your proxy server (e.g., Nginx ) |
If you already have SSL/TLS from domain name provider when you purchased it, feel free to skip this section.
Before running the installation command, make sure that Docker and Docker Compose are installed on your machine. To start data services in this project, let’s jump to the docker
directory and run command in Makefile
or docker command if you require more customization.
Usage: make [target]
Targets:
create-network Create docker network if not exists
compose-up-all Compose up with all services
compose-up-spark1 Compose up with 1 spark worker
compose-up-spark2 Compose up with 2 spark workers
compose-up-spark3 Compose up with 3 spark workers
compose-up-certbot Compose up with certbot
compose-up-trino Compose up with 1 trino workers
compose-down Compose down all services then remove network
get-cert Get certbot certificate
spark-submit Spark master submit
backup Backup OLTP, Minio and Hive metastore data
init-db Initialize adventureworks database
clean Clean temp data
clean-all Clean all
help Show this help message
Makefile (Source: Author)
For example, starting all services
cd docker
cp .env.example .env
make compose-up-all
If you can't use Makefile
, you can run the following command
cd docker
cp .env.example .env
docker network inspect data-network > /dev/null 2>&1 || docker network create --driver bridge data-network
docker compose --profile spark --profile trino --env-file .env up -d
.env
file. Additionally, you might need to make adjustments to the docker-compose.yaml file itself, such as changing image versions
, service profiles
, port mappings
, or volume mounts
, based on your specific deployment environment and requirements. After making any changes, please re-run make compose-up-all
📂 Project structure
.
├── adw14/ /* AdventureWorks data folder */
├── airflow/ /* Airflow folder contains DAGs, configurations and docker */
├── assets/ /* Contains various project assets like pdf, images, etc. */
├── dbt/ /* DBT project folder contains project files and docker */
├── docker/ /* Docker configurations and related files of data product */
│ ├── backup/ /* Backup files for databases or components */
│ ├── certbot/ /* Certbot folder for managing SSL certificates */
│ ├── lightdash/ /* Lightdash folder for business intelligence tools */
│ ├── minio/ /* MinIO for object storage (S3-compatible) */
│ ├── nginx/ /* Nginx web server and reverse proxy configurations */
│ ├── spark/ /* Apache Spark cluster configurations */
│ ├── trino/ /* Trino cluster configurations */
│ ├── volumes/ /* Docker volumes for persistent data storage */
│ ├── .env.example /* Template of environment variables file */
│ ├── docker-compose.yaml /* Docker Compose file to define data product services */
│ ├── Makefile /* Makefile for automating tasks and commands */
├── .gitignore
├── Author.md
├── LICENSE
├── README.md
└── References.md
Adventureworks data (Source: Author)
Adventureworks is utilized as the OLTP database for our project. As a comprehensive operational database published by Microsoft, Adventureworks covers a wide range of domains. For full visibility into its data schema, you can reference Dataedo. To align with the requirements of our data platform, we have migrated this database to PostgreSQL, simulating a real-world database environment that acts as the primary data source for our data lake.
How to initialize AdventureWorks data
After launching the OLTP service (PostgreSQL) in the data platform, you can import data into the database by running the following command
cd docker
make init-db
Or you can run the bash file directly in the adw14
folder
cd adw14
chmod +x adw14data.sh
adw14data.sh
Apache Airflow overview (Source: Author)
Apache Airflow serves as data platform orchestrator, which manages and schedules all data pipelines.
Airflow Docker map (Source: Author)
You can start Airflow by running this following command
cd airflow
make compose-init
# if it's the first time you run airflow, or you want to rebuild airflow image
# or
# make compose-up
In this project, I created 2 connections in Airflow. Make sure you also create required connections before running the DAGs
Airflow connections (Source: Author)
Apache Spark overview (Source: Author)
Apache Spark acts as the data processing engine responsible for extracting and loading data from the OLTP source to the bronze layer in the data lake. Spark’s power lies in its versatility, handling not only data engineering tasks but also data science and machine learning workloads, whether on single-node machines or distributed clusters. However, for the purposes of this showcase, Spark is used in a minimal capacity to allow other tools in the data stack to take center stage 😜
Spark Docker map (Source: Author)
In this project, I initialized the Spark cluster with 01 master node and 01 worker node. You can easily scale the number of worker nodes by specifying the --scale spark-worker=N
parameter, where N
represents the desired number of worker nodes
cd docker
make compose-up-spark1 #To init 01 worker node
# make compose-up-spark2 #To init 02 worker nodes
# or you can run this command without Make to init 02 worker nodes
docker compose --profile spark --scale spark-worker=2 up -d
docker
folder to have more management commands
MinIO overview (Source: Author)
😅 Ops, I don’t know which images about Hive-metastore should be added to this article. We will dive into the docker map to know more about the relationship between Hive-metastore and other services 🫡
Trino overview (Source: Author)
Trino functions as a query engine that can access and query data from multiple sources while performing transformations within its cluster. In our setup, we utilized Trino as a data warehouse, where DBT orchestrates the transformation jobs by triggering the execution directly within Trino.
Delta catalog in Trino after transforming data (Source: Author)
Trino Docker map (Source: Author)
Fact sales lineage (Source: Author)
DBT Docker map (Source: Author)
We can use DBT with 02 options:
- Cloud: dbt Cloud | dbt Labs (getdbt.com)
- DBT-core: https://github.com/dbt-labs/dbt-core
Data model: In this project, I created a simple data model with 3
dimensions. You can expand the model by adding more dimensions and metrics to the fact table, which would be a great way to enhance data analysis skills.
✌🏻 With dim customers, you can have more insightful information from XML data in the Person schema. But I’m too lazy to do it now 😜 give this chance to you 🫵🏻
Sales mart data model (Source: Author)
Lightdash overview (Source: Author)
Lightdash is used as BI tool in this project. The reason why I chose it is Lightdash is a perfect match with DBT and it’s opensource which we can use for self-hosting.
Daily delivery sales dashboard to stakeholders via email
Lightdash daily delivery email (Source: Author)
Stakeholders can download the pdf file of the sales dashboard
or directly access Lightdash to view the interactive version.
Lightdash Docker map (Source: Author)
We can use Lightdash with 02 options
- Cloud: Lightdash | Self-serve BI to 10x your data team
- Self-host: Self-Host Lightdash | Documentation | Lightdash
As seen in the Docker setup, I’ve included several essential services such as Nginx, Certbot and metadata database for data services. These components serve as the foundational building blocks of our data platform, ensuring the smooth operation of the project.
However, these services can be replaced by equivalent cloud solutions if needed
Service | Replaced solutions |
---|---|
Metadata database (Airflow, Lightdash, Hive metastore, OLTP) | Neon Serverless Postgres, Supabase |
Minio | Amazon S3, Cloudflare R2, DigitalOcean Spaces Object Storage |
I came up with the idea for this side project after a meeting with my sister, Ms. Trang (MSc Business Informatics at the University of Mannheim, Germany). I would like to express my heartfelt gratitude to Ms. Trang for her inspiration and encouragement throughout my journey. |
Thank @NhatNguyen‘s work and Dify for the main references about project components and structures.
Thank @BaoTram for giving me foundational knowledge about DBT to apply to this project.
And a big thanks to all the authors of articles, blogs, and videos I’ve watched to finish this project. You can find them in References list here.
Star the project on GitHub and be instantly notified of new releases.
For those who'd like to contribute code, see our Contribution Guide. At the same time, please consider supporting me by sharing it on social media and at events and conferences.
Contributors
- Github Discussion. Best for: sharing feedback and asking questions.
- GitHub Issues. Best for: bugs you encounter using this project, and feature proposals.
- LinkedIn Post. Best for: sharing your applications and hanging out with the community.
This repository is available under the ThanhENC Open Source License, which is essentially MIT.