Skip to content

Commit

Permalink
Added performance note to worked example
Browse files Browse the repository at this point in the history
  • Loading branch information
Alex Higgs committed Nov 1, 2019
1 parent a2f95e3 commit 8bdd40f
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 3 deletions.
2 changes: 1 addition & 1 deletion docs/staging.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ in our model.
!!! note
On line 3 below we are using a dbt source.

If you have not yet set up sources in your dbt configuration please refer to [setting up sources](gettingstarted.md#setting-up-sources).
If you have not yet set up sources in your dbt configuration please refer to [setting up sources](walkthrough.md#setting-up-sources).


```stg_customer_hashed.sql```
Expand Down
24 changes: 22 additions & 2 deletions docs/workedexample.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ We will:
- process the raw staging layer.
- create a Data Vault with hubs, links and satellites using dbtvault and pre-written models.


## Pre-requisites

These pre-requisites are separate from those found on the [getting started](walkthrough.md) page and will
Expand All @@ -37,4 +36,25 @@ be the only necessary requirements you will need to get started with the example

!!! note
We have provided a complete ```requirements.txt``` to install with ```pip install -r requirements.txt```
as a quick way of getting your Python environment set up. This file includes dbt and comes with the download in the next section.
as a quick way of getting your Python environment set up. This file includes dbt and comes with the download in the
next section.

## Performance note

Please be aware that table structures are simulated from the TPCH-H dataset. The TPC-H dataset is a static view of data.

Only a subset of the data contains dates which allows us to simulate daily feeds. The ```v_stg_orders``` orders view is
filtered by date, unfortunately the ```v_stg_inventory``` view cannot be filtered by date, so it ends up being a feed of
the entire contents of the view each cycle.

This means that inventory related hubs links and satellites are populated once during the initial load cycle with
everything and later cycles insert 0 new records in their left outer joins.

As the dataset increases in size, e.g if you run with a larger TPC-H dataset (100, 1000 etc.) then be aware you are
processing the entire inventory dataset each cycle, which results in unrepresentative load cycle times.

Unfortunately it's the nature of the dataset, it will not be that way for other datasets. We will look at additonal
datasets in the future!

If you are feeling adventurous you may disable the inventory feed (```raw_inventory``` and child models) to see a more
accurate representation of performance.

0 comments on commit 8bdd40f

Please sign in to comment.