Altered PIT statement for performance improvements #224
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rewriting the PIT Macro to increase performance in Databricks
Background
At our company we recently started adding PIT tables to our BDV layer, and have found that some of our larger, core concepts were having quite fatal performance issues.
Our Customer concept formed a PIT for 7 satellites, and was taking about 3.5 hours to write the table in Databricks (~200M rows), and another concept which tried to form a PIT for 18 satellites ran for 16 hours before we had to kill it (never completed - ~84M rows).
Changes
The way the PIT macro is currently set up joins the hub + date spline to each satellite on several rows, and then aggregates over these rows to find whether a key + date exist in this range. For large tables this causes memory issues and in our particular databricks setup causes data to spill to disk. The proposed change creates a CTE for each satellite which adds a "next" LDTS field (using a LEAD function) so that the join from hub + date spline to the satellites produces the exact row (if any) which matches the criteria of key = key and as_of_date between sat.ldts and sat.next_ldts.
Results
In practice we have found that this change lead to a dramatic improvement in performance. The PIT for our customer concept which previous took ~3.5hrs is now completing in just over 5minutes (~40x speedup), and the PIT for our other concept which never finished (but ran for at least 16hrs) now finishes in ~35min.
N.B. I haven't gone to the effort of rewriting this for other platforms, but hopefully a similar structure is just as useful for them too. Please reach out if you want to discuss!