-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize the performance of list_relations_without_caching #342
optimize the performance of list_relations_without_caching #342
Conversation
* Bumping version to 1.1.0 * Update CHANGELOG.md Co-authored-by: Github Build Bot <[email protected]> Co-authored-by: leahwicz <[email protected]>
…p-faster-caching-option2
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @TalkWIthKeyboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the initial stab here @TalkWIthKeyboard!
The blocker to running this in CI right now is the fact that the integration-spark-thrift
step is still running containerized Spark v2
# Conflicts: # .bumpversion.cfg # dbt/adapters/spark/__version__.py # setup.py
a9f5573
to
7b92fab
Compare
0b30719
to
3acde77
Compare
ed557b9
to
bf5131a
Compare
bf5131a
to
345239b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TalkWIthKeyboard This is very, very cool!!
I need to spend some more time diving in. Just for now, I very quickly played around with extending this idea to dbt-core
+ other adapters: dbt-labs/dbt-core@6862529
Big idea: What if adapter.get_columns_in_relation
could always either hit or update the cache? This could be massively useful for materializations that require running get_columns_in_relation
over and over, e.g. dbt-labs/dbt-core#2392.
We'd need to handle cache invalidation in a couple of places:
- When a model actually runs, it returns a set of relations to update in the cache, which properly wipes stale columns (from before the model build). So I think this is actually okay?
- Macros such as
alter_column_type
andalter_relation_add_remove_columns
would need to invalidate or update the cached columns - After a model runs, we could always
describe table
to cache its columns, plus table/column statistics and other profiling info if we so chose. This would allow us to make "catalog info" available in near-real-time, rather than waiting fordocs generate
That's all out of scope for this particular PR + initiative, but it's very exciting to think about this as a step on the way there.
This sounds like a lot of work, but it will make a huge increase in the performance of the cache. I would like to continue working on this when this current PR is done. |
12e4f1b
to
931e3d2
Compare
# Conflicts: # dbt/adapters/spark/impl.py
931e3d2
to
24e223c
Compare
{% macro spark__list_relations_without_caching(relation) %} | ||
{% call statement('list_relations_without_caching', fetch_result=True) -%} | ||
show table extended in {{ relation }} like '*' | ||
{% macro spark__list_tables_without_caching(relation) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use dispatch pattern?
{% macro list_tables_without_caching(relation) %}
{{ return(adapter.dispatch('list_tables_without_caching', 'dbt')(relation)) }}
{%- endmacro -%}
{% macro spark__list_tables_without_caching(relation) %}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for your comment, I will update it soon.
|
||
{% do return(load_result('list_relations_without_caching').table) %} | ||
{% macro spark__list_views_without_caching(relation) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
This is very very cool work! Just to clarify + set expectations, my current understanding is that we're blocked on merging this PR until we can test with Spark3 for all connection methods in CI (#349). That's a battle we're continuing to fight. If we can figure that out, we can unblock this change for inclusion in a forthcoming version of dbt-spark. |
@TalkWIthKeyboard , thanks for doing this work. #349 looks to be ready. Feel free to rebase (after it merges) to help support your work here. |
^ just merged #349, hopefully this is able to pass its tests! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TalkWIthKeyboard There's still a lot of interest this PR in!
Are you still able/interested in getting this merged, now that tests on Spark v3 are running in the main
branch? Or should one of us try to take it over, and see it across the finish line?
@@ -216,60 +271,72 @@ def get_columns_in_relation(self, relation: Relation) -> List[SparkColumn]: | |||
), | |||
None, | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the issues we've seen around column caching (#431), I think we need to fully disable pulling from the cache in get_columns_in_relation
. That is, until we're ready to roll out column-level caching as a real feature. That would require us to add cache invalidation logic to macros such as alter_column_type
and alter_relation_add_remove_columns
.
This PR has been marked as Stale because it has been open for 180 days with no activity. If you would like the PR to remain open, please remove the stale label or comment on the PR, or it will be closed in 7 days. |
resolves #228
Description
running two concise queries
show views in <database>
+show tables in <database>
rather than one verbose oneshow tables extended in <database> like '*'
Checklist
CHANGELOG.md
and added information about my change to the "dbt-spark next" section.