optimize the performance of list_relations_without_caching #342

TalkWIthKeyboard · 2022-05-03T02:24:08Z

resolves #228

Description

running two concise queries show views in <database> + show tables in <database> rather than one verbose one show tables extended in <database> like '*'

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-spark next" section.

* Bumping version to 1.1.0 * Update CHANGELOG.md Co-authored-by: Github Build Bot <[email protected]> Co-authored-by: leahwicz <[email protected]>

…p-faster-caching-option2

cla-bot · 2022-05-03T02:24:13Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @TalkWIthKeyboard

jtcohen6

Thanks for taking the initial stab here @TalkWIthKeyboard!

The blocker to running this in CI right now is the fact that the integration-spark-thrift step is still running containerized Spark v2

dbt/include/spark/macros/adapters.sql

dbt/adapters/spark/impl.py

# Conflicts: # .bumpversion.cfg # dbt/adapters/spark/__version__.py # setup.py

jtcohen6

@TalkWIthKeyboard This is very, very cool!!

I need to spend some more time diving in. Just for now, I very quickly played around with extending this idea to dbt-core + other adapters: dbt-labs/dbt-core@6862529

Big idea: What if adapter.get_columns_in_relation could always either hit or update the cache? This could be massively useful for materializations that require running get_columns_in_relation over and over, e.g. dbt-labs/dbt-core#2392.

We'd need to handle cache invalidation in a couple of places:

When a model actually runs, it returns a set of relations to update in the cache, which properly wipes stale columns (from before the model build). So I think this is actually okay?
Macros such as alter_column_type and alter_relation_add_remove_columns would need to invalidate or update the cached columns
After a model runs, we could always describe table to cache its columns, plus table/column statistics and other profiling info if we so chose. This would allow us to make "catalog info" available in near-real-time, rather than waiting for docs generate

That's all out of scope for this particular PR + initiative, but it's very exciting to think about this as a step on the way there.

dbt/adapters/spark/relation.py

dbt/adapters/spark/impl.py

…to get_relation

TalkWIthKeyboard · 2022-05-21T13:41:58Z

@TalkWIthKeyboard This is very, very cool!!

I need to spend some more time diving in. Just for now, I very quickly played around with extending this idea to dbt-core + other adapters: dbt-labs/dbt-core@6862529

Big idea: What if adapter.get_columns_in_relation could always either hit or update the cache? This could be massively useful for materializations that require running get_columns_in_relation over and over, e.g. dbt-labs/dbt-core#2392.

We'd need to handle cache invalidation in a couple of places:

When a model actually runs, it returns a set of relations to update in the cache, which properly wipes stale columns (from before the model build). So I think this is actually okay?

Macros such as alter_column_type and alter_relation_add_remove_columns would need to invalidate or update the cached columns

After a model runs, we could always describe table to cache its columns, plus table/column statistics and other profiling info if we so chose. This would allow us to make "catalog info" available in near-real-time, rather than waiting for docs generate

That's all out of scope for this particular PR + initiative, but it's very exciting to think about this as a step on the way there.

This sounds like a lot of work, but it will make a huge increase in the performance of the cache. I would like to continue working on this when this current PR is done.

# Conflicts: # dbt/adapters/spark/impl.py

ueshin · 2022-05-23T23:45:15Z

dbt/include/spark/macros/adapters.sql

-{% macro spark__list_relations_without_caching(relation) %}
-  {% call statement('list_relations_without_caching', fetch_result=True) -%}
-    show table extended in {{ relation }} like '*'
+{% macro spark__list_tables_without_caching(relation) %}


Could you use dispatch pattern?

{% macro list_tables_without_caching(relation) %} {{ return(adapter.dispatch('list_tables_without_caching', 'dbt')(relation)) }} {%- endmacro -%} {% macro spark__list_tables_without_caching(relation) %} ...

Sure, thanks for your comment, I will update it soon.

ueshin · 2022-05-23T23:45:22Z

dbt/include/spark/macros/adapters.sql


-  {% do return(load_result('list_relations_without_caching').table) %}
+{% macro spark__list_views_without_caching(relation) %}


jtcohen6 · 2022-06-15T08:45:50Z

This is very very cool work!

Just to clarify + set expectations, my current understanding is that we're blocked on merging this PR until we can test with Spark3 for all connection methods in CI (#349). That's a battle we're continuing to fight. If we can figure that out, we can unblock this change for inclusion in a forthcoming version of dbt-spark.

nssalian · 2022-06-20T01:25:12Z

@TalkWIthKeyboard , thanks for doing this work. #349 looks to be ready. Feel free to rebase (after it merges) to help support your work here.

jtcohen6 · 2022-06-28T15:46:38Z

^ just merged #349, hopefully this is able to pass its tests!

jtcohen6

@TalkWIthKeyboard There's still a lot of interest this PR in!

Are you still able/interested in getting this merged, now that tests on Spark v3 are running in the main branch? Or should one of us try to take it over, and see it across the finish line?

jtcohen6 · 2022-08-18T17:07:14Z

dbt/adapters/spark/impl.py

@@ -216,60 +271,72 @@ def get_columns_in_relation(self, relation: Relation) -> List[SparkColumn]:
            ),
            None,
        )
+


Given the issues we've seen around column caching (#431), I think we need to fully disable pulling from the cache in get_columns_in_relation. That is, until we're ready to roll out column-level caching as a real feature. That would require us to add cache invalidation logic to macros such as alter_column_type and alter_relation_add_remove_columns.

github-actions · 2023-02-15T02:01:32Z

This PR has been marked as Stale because it has been open for 180 days with no activity. If you would like the PR to remain open, please remove the stale label or comment on the PR, or it will be closed in 7 days.

jtcohen6 and others added 8 commits March 9, 2022 18:03

option 2: show tables + show views

c67c5a3

simplify

ecc1847

Updating requirements to point to release branch

4a3fa57

Backport table not exist, Bumping version to 1.1.0rc1 (dbt-labs#333)

6862c7f

Bumping version to 1.1.0 (dbt-labs#341)

ac50f23

* Bumping version to 1.1.0 * Update CHANGELOG.md Co-authored-by: Github Build Bot <[email protected]> Co-authored-by: leahwicz <[email protected]>

Merge commit 'ac50f231cae3fa61795259eef0204a613fd68f85' into jerco/wi…

3049247

…p-faster-caching-option2

fix: the tbl name in the result of show tables/views

bf5a1d6

fix: select existing ones from namespace and database

ba1c87b

jtcohen6 reviewed May 3, 2022

View reviewed changes

dbt/include/spark/macros/adapters.sql Show resolved Hide resolved

dbt/adapters/spark/impl.py Show resolved Hide resolved

Merge branch 'main' into talkwithkeyboard/faster-caching

7b92fab

# Conflicts: # .bumpversion.cfg # dbt/adapters/spark/__version__.py # setup.py

TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch from a9f5573 to 7b92fab Compare May 4, 2022 01:41

cla-bot bot added the cla:yes label May 4, 2022

TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch 17 times, most recently from 0b30719 to 3acde77 Compare May 7, 2022 08:53

TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch from ed557b9 to bf5131a Compare May 15, 2022 10:07

fix NPE while get_columns_in_relation

345239b

TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch from bf5131a to 345239b Compare May 15, 2022 10:14

TalkWIthKeyboard requested a review from jtcohen6 May 15, 2022 10:27

polish

101c6dd

jtcohen6 added a commit to dbt-labs/dbt-core that referenced this pull request May 15, 2022

Experiment with caching columns, inspired by dbt-labs/dbt-spark#342

6862529

jtcohen6 reviewed May 15, 2022

View reviewed changes

dbt/adapters/spark/relation.py Outdated Show resolved Hide resolved

dbt/adapters/spark/impl.py Outdated Show resolved Hide resolved

jtcohen6 reviewed May 15, 2022

View reviewed changes

dbt/adapters/spark/impl.py Show resolved Hide resolved

feat: store SparkColumns in the SparkRelation, add needs_information …

d29bd47

…to get_relation

TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch 3 times, most recently from 12e4f1b to 931e3d2 Compare May 21, 2022 14:09

Merge branch 'main' into talkwithkeyboard/faster-caching

24e223c

# Conflicts: # dbt/adapters/spark/impl.py

TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch from 931e3d2 to 24e223c Compare May 21, 2022 14:49

ueshin reviewed May 23, 2022

View reviewed changes

TalkWIthKeyboard requested a review from jtcohen6 May 24, 2022 03:49

support dispatch pattern for list marco

d2357b7

jtcohen6 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Jun 15, 2022

jtcohen6 mentioned this pull request Jun 16, 2022

[CT-431] Support for Pyspark driver #305

Closed

jtcohen6 mentioned this pull request Aug 17, 2022

[CT-1051] Not correctly running with schema change #431

Closed

jtcohen6 reviewed Aug 18, 2022

View reviewed changes

jtcohen6 pushed a commit that referenced this pull request Aug 19, 2022

Squashed commits from #342

2eafd5b

jtcohen6 mentioned this pull request Aug 19, 2022

Dramatically faster caching #433

Closed

4 tasks

github-actions bot added the Stale label Feb 15, 2023

github-actions bot closed this Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize the performance of list_relations_without_caching #342

optimize the performance of list_relations_without_caching #342

TalkWIthKeyboard commented May 3, 2022 •

edited

Loading

cla-bot bot commented May 3, 2022

jtcohen6 left a comment

jtcohen6 left a comment •

edited

Loading

TalkWIthKeyboard commented May 21, 2022

ueshin May 23, 2022

TalkWIthKeyboard May 24, 2022

ueshin May 23, 2022

jtcohen6 commented Jun 15, 2022

nssalian commented Jun 20, 2022 •

edited

Loading

jtcohen6 commented Jun 28, 2022

jtcohen6 left a comment

jtcohen6 Aug 18, 2022

github-actions bot commented Feb 15, 2023


		{% do return(load_result('list_relations_without_caching').table) %}
		{% macro spark__list_views_without_caching(relation) %}

optimize the performance of list_relations_without_caching #342

optimize the performance of list_relations_without_caching #342

Conversation

TalkWIthKeyboard commented May 3, 2022 • edited Loading

Description

Checklist

cla-bot bot commented May 3, 2022

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 left a comment • edited Loading

Choose a reason for hiding this comment

TalkWIthKeyboard commented May 21, 2022

ueshin May 23, 2022

Choose a reason for hiding this comment

TalkWIthKeyboard May 24, 2022

Choose a reason for hiding this comment

ueshin May 23, 2022

Choose a reason for hiding this comment

jtcohen6 commented Jun 15, 2022

nssalian commented Jun 20, 2022 • edited Loading

jtcohen6 commented Jun 28, 2022

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Aug 18, 2022

Choose a reason for hiding this comment

github-actions bot commented Feb 15, 2023

TalkWIthKeyboard commented May 3, 2022 •

edited

Loading

jtcohen6 left a comment •

edited

Loading

nssalian commented Jun 20, 2022 •

edited

Loading