In this project, I designed and developed an ELT (Extract, Load, Transform) pipeline for a retail store, automating the data flow from raw data collection to insightful visualizations.
-
Data Collection:
OLTP Data: Data is extracted from the store’s OLTP database (RDS) and loaded into Snowflake using Airbyte. This process is scheduled to run daily.
S3 Data: Data is also collected from an S3 bucket and loaded into Snowflake using an AWS Lambda function. This Lambda function is orchestrated to run daily at 6 AM using Amazon EventBridge.
-
Data Transformation: The data in Snowflake is transformed using dbt (Data Build Tool) to align with business requirements. The transformation models are scheduled to run automatically through a cron job.
-
Data Visualization: The transformed data is visualized using Metabase. The dashboards and reports in Metabase are configured to refresh daily, ensuring that they reflect the most up-to-date data.
A retail store needs a Data Engineer to develop an ELT (Extract, Load, Transform) pipeline that extracts raw data from the company’s database, transforms it to meet business needs, and offers a platform for Data Analysts to create visualizations that address key business questions.
1.Identification of the highest and lowest-performing items of the week by analyzing sales amounts and quantities. 2.Display of items with low supply levels for each week. 3.Detection of items experiencing low stock levels, along with their corresponding week and warehouse numbers, marked as "True". 4.Analysis of Revenue 5.Analysis of Customer 6.Analysis of Warehouse based on low stock items and Revenue
-
OLTP Data Ingestion (Airbyte) : Airbyte was hosted on EC2 instance and configured to ingest all but inventory table from RDS into snowflake. Replication freq: 6:00 UTC , Sync mode:Full refresh
-
S3 Data Ingestion (AWS Lambda) : Inventory table is ingested using a custom python function powered by Lambda. Orchestration: Eventbridge
-
Data Transformation (DBT) : a. All tables are loaded into the staging area b. Integration of several customer-related tables into one Customer table c. Creating a snapshot of the Customer table (for SCD type 2) in the intermediate layer d. Creating a daily sales facts table in the intermediate layere e. Creating the customer dimension and weekly_sales_inventory models in the marts layer using the intermediate models f. The new fact table 'weekly_sales_inventory' provides the additional metrics
sum_qty_wk: Sum of sales_quantity for the current week. sum_amt_wk: Sum of sales_amount for the current week. sum_profit_wk: Sum of net_profit for the current week. avg_qty_dy: Average daily sales_quantity for the current week (calculated as sum_qty_wk/7). inv_on_hand_qty_wk: Item inventory on hand at the end of each week across all warehouses (equivalent to the inventory on hand at the end of the weekend). wks_sply: Weeks of supply, an estimated metric indicating how many weeks the inventory can supply the sales (calculated as inv_on_hand_qty_wk/sum_qty_wk). low_stock_flg_wk: Low stock weekly flag. If, during the week, there is a single day where [(avg_qty_dy > 0 && avg_qty_dy > inventory_on_hand_qty_wk)], then set the flag to True.
g. Testing: various test like not null, unique, relationship, accepted values were carried out on relevant table columns in the marts model
h. Documentation:detailed column and table descriptions helped to generate exhaustive documentation
i. Data Analyzation (Metabase) : Mart models are fed into Metabase for analytics purposes
Table relationships and column semantic types are set to aid analysis as below
j. Dashboards : These are created with SQL queries and Metabase GUI. Dashboards are categorized into different tabs to answer business questions with relavant filters to drill things down
a. Performance was based on Revenue, profit could be considered to see the effect of cost on the business b. Filter by multiple criteria within each tab in Metabase c. Use Airbyte to ingest data from s3 bucket d. Perform more data quality tests