Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize the performance of list_relations_without_caching #342

Conversation

TalkWIthKeyboard
Copy link

@TalkWIthKeyboard TalkWIthKeyboard commented May 3, 2022

resolves #228

Description

running two concise queries show views in <database> + show tables in <database> rather than one verbose one show tables extended in <database> like '*'

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-spark next" section.

@cla-bot
Copy link

cla-bot bot commented May 3, 2022

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @TalkWIthKeyboard

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the initial stab here @TalkWIthKeyboard!

The blocker to running this in CI right now is the fact that the integration-spark-thrift step is still running containerized Spark v2

dbt/include/spark/macros/adapters.sql Show resolved Hide resolved
dbt/adapters/spark/impl.py Show resolved Hide resolved
# Conflicts:
#	.bumpversion.cfg
#	dbt/adapters/spark/__version__.py
#	setup.py
@TalkWIthKeyboard TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch from a9f5573 to 7b92fab Compare May 4, 2022 01:41
@cla-bot cla-bot bot added the cla:yes label May 4, 2022
@TalkWIthKeyboard TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch 17 times, most recently from 0b30719 to 3acde77 Compare May 7, 2022 08:53
@TalkWIthKeyboard TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch from ed557b9 to bf5131a Compare May 15, 2022 10:07
@TalkWIthKeyboard TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch from bf5131a to 345239b Compare May 15, 2022 10:14
@TalkWIthKeyboard TalkWIthKeyboard requested a review from jtcohen6 May 15, 2022 10:27
jtcohen6 added a commit to dbt-labs/dbt-core that referenced this pull request May 15, 2022
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TalkWIthKeyboard This is very, very cool!!

I need to spend some more time diving in. Just for now, I very quickly played around with extending this idea to dbt-core + other adapters: dbt-labs/dbt-core@6862529

Big idea: What if adapter.get_columns_in_relation could always either hit or update the cache? This could be massively useful for materializations that require running get_columns_in_relation over and over, e.g. dbt-labs/dbt-core#2392.

We'd need to handle cache invalidation in a couple of places:

  • When a model actually runs, it returns a set of relations to update in the cache, which properly wipes stale columns (from before the model build). So I think this is actually okay?
  • Macros such as alter_column_type and alter_relation_add_remove_columns would need to invalidate or update the cached columns
  • After a model runs, we could always describe table to cache its columns, plus table/column statistics and other profiling info if we so chose. This would allow us to make "catalog info" available in near-real-time, rather than waiting for docs generate

That's all out of scope for this particular PR + initiative, but it's very exciting to think about this as a step on the way there.

dbt/adapters/spark/relation.py Outdated Show resolved Hide resolved
dbt/adapters/spark/impl.py Outdated Show resolved Hide resolved
@TalkWIthKeyboard
Copy link
Author

@TalkWIthKeyboard This is very, very cool!!

I need to spend some more time diving in. Just for now, I very quickly played around with extending this idea to dbt-core + other adapters: dbt-labs/dbt-core@6862529

Big idea: What if adapter.get_columns_in_relation could always either hit or update the cache? This could be massively useful for materializations that require running get_columns_in_relation over and over, e.g. dbt-labs/dbt-core#2392.

We'd need to handle cache invalidation in a couple of places:

  • When a model actually runs, it returns a set of relations to update in the cache, which properly wipes stale columns (from before the model build). So I think this is actually okay?
  • Macros such as alter_column_type and alter_relation_add_remove_columns would need to invalidate or update the cached columns
  • After a model runs, we could always describe table to cache its columns, plus table/column statistics and other profiling info if we so chose. This would allow us to make "catalog info" available in near-real-time, rather than waiting for docs generate

That's all out of scope for this particular PR + initiative, but it's very exciting to think about this as a step on the way there.

This sounds like a lot of work, but it will make a huge increase in the performance of the cache. I would like to continue working on this when this current PR is done.

@TalkWIthKeyboard TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch 3 times, most recently from 12e4f1b to 931e3d2 Compare May 21, 2022 14:09
# Conflicts:
#	dbt/adapters/spark/impl.py
@TalkWIthKeyboard TalkWIthKeyboard force-pushed the talkwithkeyboard/faster-caching branch from 931e3d2 to 24e223c Compare May 21, 2022 14:49
{% macro spark__list_relations_without_caching(relation) %}
{% call statement('list_relations_without_caching', fetch_result=True) -%}
show table extended in {{ relation }} like '*'
{% macro spark__list_tables_without_caching(relation) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use dispatch pattern?

{% macro list_tables_without_caching(relation) %}
  {{ return(adapter.dispatch('list_tables_without_caching', 'dbt')(relation)) }}
{%- endmacro -%}

{% macro spark__list_tables_without_caching(relation) %}
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for your comment, I will update it soon.


{% do return(load_result('list_relations_without_caching').table) %}
{% macro spark__list_views_without_caching(relation) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@TalkWIthKeyboard TalkWIthKeyboard requested a review from jtcohen6 May 24, 2022 03:49
@jtcohen6
Copy link
Contributor

This is very very cool work!

Just to clarify + set expectations, my current understanding is that we're blocked on merging this PR until we can test with Spark3 for all connection methods in CI (#349). That's a battle we're continuing to fight. If we can figure that out, we can unblock this change for inclusion in a forthcoming version of dbt-spark.

@jtcohen6 jtcohen6 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Jun 15, 2022
@nssalian
Copy link
Contributor

nssalian commented Jun 20, 2022

@TalkWIthKeyboard , thanks for doing this work. #349 looks to be ready. Feel free to rebase (after it merges) to help support your work here.

@jtcohen6
Copy link
Contributor

^ just merged #349, hopefully this is able to pass its tests!

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TalkWIthKeyboard There's still a lot of interest this PR in!

Are you still able/interested in getting this merged, now that tests on Spark v3 are running in the main branch? Or should one of us try to take it over, and see it across the finish line?

@@ -216,60 +271,72 @@ def get_columns_in_relation(self, relation: Relation) -> List[SparkColumn]:
),
None,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the issues we've seen around column caching (#431), I think we need to fully disable pulling from the cache in get_columns_in_relation. That is, until we're ready to roll out column-level caching as a real feature. That would require us to add cache invalidation logic to macros such as alter_column_type and alter_relation_add_remove_columns.

jtcohen6 pushed a commit that referenced this pull request Aug 19, 2022
@jtcohen6 jtcohen6 mentioned this pull request Aug 19, 2022
4 tasks
@github-actions
Copy link
Contributor

This PR has been marked as Stale because it has been open for 180 days with no activity. If you would like the PR to remain open, please remove the stale label or comment on the PR, or it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 15, 2023
@github-actions github-actions bot closed this Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-202] Workaround for some limitations due to list_relations_without_caching method
7 participants