Data applications fetch information through ETL pipelines from the OLTP database. Subsequent transformations occur within the data warehouse, which follows the Medalion Lakehouse Architecture. This architecture facilitates data delivery to various consumers. The warehouse is structured into Bronze, Silver, and Gold stages for efficient data processing.
The Bronze stage stores raw data using Change Data Capture, preserving historical changes for Slowly Changing Dimension Type 1. This stage is useful for consumers such as ML applications and other predictive data applications.
Data undergoes transformation in the Silver Stage, involving type casting and data cleaning. Data historization is ensured through Slowly Changing Dimension Type 2. The structured nature of the data and the absence of Personally Identifiable Information (PII) in dimensions make in-house ELT the chosen method for data processing. Potential consumers of the Silver Stage include Data Marts serving various business objectives and supporting time-series-based analyses.
Transformed data within the data warehouse is modeled into a Kimball-style star schema in the Gold Stage. In this stage, the account dimension and Sales fact table ensure data changes up to Slowly Changing Dimension Type 1. The "Never Process Fact" approach is applied to the transactional sales fact table for Later Arriving Dimensions. This fact table provides detailed information about sales events over time. The Calendar dimension serves as a Role-playing dimension, referenced multiple times by the fact table. The Account dimension can be utilized as a Conformed dimension when needed. Possible consumers of the Gold Stage include quarterly Business Intelligence applications and ad hoc queries.
Prefect is used for monitoring and scheduling pipelines. Delay between pipelines is 15 minutes. For running metioned pipeles, you should login Prefect Cloud and you should run this script for creating connection:
$ cd .\schedular\
$ prefect cloud login
After creating a connection between local and cloud environments, you should edit blocks that contain credentials of the data warehouse and transactional database. You can use a script or UI for running pipelines. They are scheduled for each day but they can be triggered manually. You can use these scripts for deploiyng flows to Prefect Cloud. Mentioned stpes should be done for each pipeline/flow. Scripts:
$ cd .\scheduler\
$ prefect deployment run '{flow_name/deployment_name}'
$ python {python file which contains flow}
We can see the pipelines below illustration which are scheduled according presented order.
Data extraction takes snapshots of data daily. The daily snapshot contains inserted and updated data.
The data that has been extracted is sent to the Bronze stage of the data warehouse using Change Data Capture.
In the Silver stage of the data warehouse, all data transformation processes take place. PL/pgSQL is employed for processing, enhancing the speed and performance of transformations, particularly in environments where storage and processing are distinct.
Transformations in this context involve type casting, which converts character-based date descriptive information into date types. This process is crucial as it enables us to conduct historical data analyses based on quarters, holidays, or other necessary date details. To facilitate casting data, the calendar table supplies a two-year range of data. Additionally, the plan duration attribute is reformatted from a categorical data format to a numerical one.
During the Silver Stage, data is flagged, and only valid data is structured into a Kimball-style star schema. The current valid data is then modeled into one transactional fact table and four dimensions.
Listed pipelines are used for this purpose:
- to_oltp_database_flow
- to_bronze_flow
- to_dim_account_flow
- to_dim_calendar_flow
- to_dim_device_flow
- to_dim_subscription
- to_fct_sales_flow
First of all, all necessary libraries should be installed. For this purpose, you should create a virtual environment and install the needed libraries.
$ python -m venv venv
$ venv\Scripts\activate
$ pip install -r requirements.txt
You can use Docker Compose to set up both the source and data warehouse environments, complete with all the necessary tables and functions.
$ docker pull bitnami/postgresql:latest
$ docker pull adminer:latest
$ cd .\setup\
$ docker-compose up -d
Field Name | Description | Type |
---|---|---|
id | Unique identifier for each record. | Unique Identifier |
subscription_type | Type of subscription (e.g., 'Basic', 'Premium', 'Standard'). | Categorical |
monthly_revenue | Monthly revenue associated with different subscription plans. | Numerical List |
join_date_range | Range of dates during which users joined the subscription service. | Date Range |
last_payment_date_range | Range of dates indicating the period of the last payment made by users. | Date Range |
cancel_date_range | Range of dates indicating the period during which users canceled their subscription (nullable). | Nullable Date Range |
country | Country of residence for the user (e.g., 'Brazil', 'Italy', 'UK', 'US', 'Germany', 'Mexico', 'France'). | Categorical |
age_range | Minimum and maximum age allowed for users. | Tuple (Minimum, Maximum) |
gender | Gender of the user (e.g., 'Male', 'Female'). | Categorical |
device | Device used by the user for accessing the subscription service. | Categorical |
plan_duration | Duration of the subscription plans (e.g., '1 month', '6 month'). | Categorical |
active_profiles | Number of active profiles associated with a user account. | Numerical List |
household_profile_ind | Binary indicator (1 or 2) representing whether the user has multiple profiles within the household. | Binary |
movies_watched | Count of movies watched by the user. | Numerical List |
series_watched | Count of series watched by the user. | Numerical List |