Introduction

The purpose of this project is to receive invoice data from Message Hub and persist it to a DB2 Warehouse using IBM WDP Streaming Pipelines

Dependencies

This project has a dependency on:
- The deployment of the project kafka-producer-for-simulated-data
- On the dataset OnlineRetail.csv that is output from the dataset-generator project.

You have an IBM Cloud account
You have followed the instructions in the project dataset-generator to create a dataset, OnlineRetail.json.gz.
You have deployed the project kafka-producer-for-simulated-data to your IBM Cloud account
You have some knowledge of DB2 Warehouse on Cloud
You have some knowledge of IBM WDP Streaming Pipelines

None of these steps have been developed.

Extract dimension tables from OnlineRetail.csv:
- County: CountryID int | CountryName string
- Product: StockID int | StockCode string | Dscription string
- Time: TBC
- Customer: CustomerID int | CustomerName string (the input data only has customer id, so manually create customer (company) names or use a tool like faker: https://pypi.python.org/pypi/fake-factory/0.2)
- Store: StoreID int | StoreName string (the input date doesn't have storename - we could just use geographic place names, e.g. Bristol, London, ...)
Write python code to extract the above dimension tables and create the DB2 DDL and insert statements.
Write fact table DDL, e.g.
- TransactionID int | CountryID int | ProductID int | TimeID int | CustomerID int | StoreID int | Quantity int | UnitPrice decimal
Streaming Pipelines
- Create a pipeline: Message Hub -> Python Code -> DB2
- Python code:
  - In the init(state) method, lookup and hold in memory all of the above dimension tables
  - Map each record flowing through to retrieve the ID fields for each of the dimension tables

This project uses type 1 slowly changing dimensions. A future improvement will be to use more sophisticated slowly changing dimensions.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md