Skip to content

Latest commit

 

History

History
78 lines (59 loc) · 5.67 KB

datadictionary.md

File metadata and controls

78 lines (59 loc) · 5.67 KB

1. Data Dictionary for the Output of the Query

Field Name Type Description
customer_id STRING Unique identifier for the customer. It is a primary key used to join different tables in the data queries.
grouped_category STRING A reformulated category name based on the sub_category of the item purchased. This field is derived from a case statement that maps sub_categories like 'DAIRY', 'MEAT', etc., to more descriptive names such as 'DAIRY PRODUCTS', 'FRESH MEAT', etc.
revenue_in_category DECIMAL The total revenue generated from sales within each grouped category for each customer over the specified fiscal year. This is calculated as the sum of gross sales.
purchases_in_category INTEGER The count of distinct purchase transactions (receipt numbers) within each grouped category for each customer over the specified fiscal year.
total_revenue DECIMAL (Joined from total_customer_revenue) The total revenue generated by each customer across all categories over the specified fiscal year. This value is the sum of gross sales for each customer.
total_purchases INTEGER (Joined from total_customer_purchases) The total number of distinct purchase transactions (receipt numbers) made by each customer over the specified fiscal year.

Data Dictionary for the Script

The script processes data from the "database_name"."sales_data_table" using SQL WITH clauses defined above and outputs the results to an AWS S3 bucket in Parquet format with Snappy compression. It specifically performs the following operations:

1. total_customer_revenue:

  • Purpose: Calculates the total revenue for each customer by summing up gross sales where the customer_id is not null, and transactions are within the fiscal year 2023-04-01 to 2024-03-31.

2. total_customer_purchases:

  • Purpose: Counts the number of distinct purchases (receipt numbers) made by each customer, ensuring transactions fall within the same fiscal year and customer_id is not null.

3. revenue_purchases_by_category:

  • Purpose: Groups sales data by reformulated categories to calculate total revenue and number of purchases within each category for each customer, using similar date and customer_id constraints.

The final SELECT operation:

  • Joins the results from these WITH clause tables to consolidate and report on the revenue and purchases by category alongside the total revenue and purchases for each customer. Output is then unloaded to an S3 bucket, formatted for efficiency and analytical processing.

2. Data Dictionary for the Spark Script

This data dictionary outlines the flow of data processing performed by the script using Apache Spark within an AWS Glue job.

Functions and Transformations

  1. Initialization:

    • Spark Context: Used to manage Spark job processes.
    • Glue Context: Initializes the context for AWS Glue jobs.
    • Spark Session: Provides a point of interaction with underlying Spark functionality.
    • Glue Job: Configuration and execution of the AWS Glue job.
  2. Data Loading:

    • Dynamic Frame Creation: Data is loaded into a dynamic frame from AWS Glue Catalog.
    • DataFrame Conversion: Converts Glue DynamicFrame to Spark DataFrame.
  3. Column Data Types:

    • customer_id: Identified as a string, key identifier for customer.
    • grouped_category: String type, represents categories aggregated from subcategories.
    • category_revenue: Float type, holds revenue data per category.
    • category_purchases: Float type, stores purchase count data per category.
  4. Metric Calculation (calculate_metrics function):

    • Total Revenue Per Customer: Aggregates revenue by customer.
    • Revenue Percentage: Calculates percentage of total revenue by category per customer.
    • Total Purchases Per Customer: Aggregates purchase counts by customer.
    • Purchase Percentage: Calculates percentage of total purchases by category per customer.
    • Composite Metric: Final metric combining revenue and purchase percentages, scaled by 100.
  5. Shopping Mission Selection (select_shopping_mission function):

    • Window Specification: Defines the partitioning and ordering of data for selection.
    • Row Number Application: Identifies the highest-ranking shopping mission per customer.
    • Data Filtering: Selects the top result per customer to determine the primary shopping mission.
    • Column Renaming: For clarity in final output.
  6. Data Output:

    • Data Storage: Results are stored in S3 in Parquet format.
    • Partitioning: Data is partitioned by 'Period' for organized storage.
    • Compression: Uses Snappy compression to optimize storage.

Final DataFrame Output Columns:

Output Column Name Description
customer Renamed 'customer_id', identifier for customer.
shopping_mission Renamed 'grouped_category', primary category of interest for the customer.
mission_revenue Renamed 'category_revenue', revenue from the primary shopping mission.
mission_purchases Renamed 'category_purchases', purchase count from the primary shopping mission.
mission_metric Renamed 'metric', calculated metric determining the significance of the shopping mission.

This script is structured to perform comprehensive data transformation and analysis, leveraging Spark's distributed computing capabilities within the AWS Glue environment to efficiently process and analyze large datasets.