Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Altered PIT statement for performance improvements #224

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adamgrbac
Copy link

Rewriting the PIT Macro to increase performance in Databricks

Background

At our company we recently started adding PIT tables to our BDV layer, and have found that some of our larger, core concepts were having quite fatal performance issues.

Our Customer concept formed a PIT for 7 satellites, and was taking about 3.5 hours to write the table in Databricks (~200M rows), and another concept which tried to form a PIT for 18 satellites ran for 16 hours before we had to kill it (never completed - ~84M rows).

Changes

The way the PIT macro is currently set up joins the hub + date spline to each satellite on several rows, and then aggregates over these rows to find whether a key + date exist in this range. For large tables this causes memory issues and in our particular databricks setup causes data to spill to disk. The proposed change creates a CTE for each satellite which adds a "next" LDTS field (using a LEAD function) so that the join from hub + date spline to the satellites produces the exact row (if any) which matches the criteria of key = key and as_of_date between sat.ldts and sat.next_ldts.

Results

In practice we have found that this change lead to a dramatic improvement in performance. The PIT for our customer concept which previous took ~3.5hrs is now completing in just over 5minutes (~40x speedup), and the PIT for our other concept which never finished (but ran for at least 16hrs) now finishes in ~35min.

N.B. I haven't gone to the effort of rewriting this for other platforms, but hopefully a similar structure is just as useful for them too. Please reach out if you want to discuss!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant