From 8bdd40f2059d56a1710d6fce50462c4d715fe69f Mon Sep 17 00:00:00 2001 From: Alex Higgs Date: Fri, 1 Nov 2019 14:09:31 +0000 Subject: [PATCH] Added performance note to worked example --- docs/staging.md | 2 +- docs/workedexample.md | 24 ++++++++++++++++++++++-- 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/docs/staging.md b/docs/staging.md index 1dc18b445..8e18e9c6a 100644 --- a/docs/staging.md +++ b/docs/staging.md @@ -63,7 +63,7 @@ in our model. !!! note On line 3 below we are using a dbt source. - If you have not yet set up sources in your dbt configuration please refer to [setting up sources](gettingstarted.md#setting-up-sources). + If you have not yet set up sources in your dbt configuration please refer to [setting up sources](walkthrough.md#setting-up-sources). ```stg_customer_hashed.sql``` diff --git a/docs/workedexample.md b/docs/workedexample.md index 4a59149e9..1ee127306 100644 --- a/docs/workedexample.md +++ b/docs/workedexample.md @@ -17,7 +17,6 @@ We will: - process the raw staging layer. - create a Data Vault with hubs, links and satellites using dbtvault and pre-written models. - ## Pre-requisites These pre-requisites are separate from those found on the [getting started](walkthrough.md) page and will @@ -37,4 +36,25 @@ be the only necessary requirements you will need to get started with the example !!! note We have provided a complete ```requirements.txt``` to install with ```pip install -r requirements.txt``` - as a quick way of getting your Python environment set up. This file includes dbt and comes with the download in the next section. \ No newline at end of file + as a quick way of getting your Python environment set up. This file includes dbt and comes with the download in the + next section. + +## Performance note + +Please be aware that table structures are simulated from the TPCH-H dataset. The TPC-H dataset is a static view of data. + +Only a subset of the data contains dates which allows us to simulate daily feeds. The ```v_stg_orders``` orders view is +filtered by date, unfortunately the ```v_stg_inventory``` view cannot be filtered by date, so it ends up being a feed of +the entire contents of the view each cycle. + +This means that inventory related hubs links and satellites are populated once during the initial load cycle with +everything and later cycles insert 0 new records in their left outer joins. + +As the dataset increases in size, e.g if you run with a larger TPC-H dataset (100, 1000 etc.) then be aware you are +processing the entire inventory dataset each cycle, which results in unrepresentative load cycle times. + +Unfortunately it's the nature of the dataset, it will not be that way for other datasets. We will look at additonal +datasets in the future! + +If you are feeling adventurous you may disable the inventory feed (```raw_inventory``` and child models) to see a more +accurate representation of performance. \ No newline at end of file