-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make asset backfill eligibility logic subset-aware instead of iterating through partition keys individually #25997
base: master
Are you sure you want to change the base?
Conversation
de05d19
to
3af0522
Compare
5e8878c
to
7d1ff8b
Compare
7ad79e0
to
04f3df1
Compare
0e17d76
to
7a03d4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly, I think this is looking really good, but this is a big change to a complicated bit of machinery, so I'm being a bit pickier about naming / code structure just to make sure this is easier to interact with in the future
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Outdated
Show resolved
Hide resolved
serializable_entity_subset = entity_subset.convert_to_serializable_subset() | ||
|
||
serializable_entity_subsets = [ | ||
SerializableEntitySubset(key=asset_key, value=serializable_entity_subset.value) | ||
for asset_key in execution_set_keys | ||
] | ||
|
||
entity_subsets = [ | ||
check.not_none( | ||
self._asset_graph_view.get_subset_from_serializable_subset( | ||
serializable_entity_subset | ||
) | ||
) | ||
for serializable_entity_subset in serializable_entity_subsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
painge, but not really a better option here. could be nice to extract this into a utility (DA has to do something similar), but that doesn't need to be in this PR
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/asset_graph_view/asset_graph_view.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
116be3e
to
fa638ec
Compare
4c10f54
to
7c76dc2
Compare
a2ec297
to
a20c4cb
Compare
Summary: Pulling out the new BFS utility added in #25997 to its own function with its own tests to keep the PR size manageable. Test Plan: New tests NOCHANGELOG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved the new BFS method into #26751 to make this PR just about backfills and using that method
b93ead9
to
41e786b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very exciting!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have one test that would be good to add, but otherwise looks good!
repository = code_location.get_repository("test_repo") | ||
asset_graph = repository.asset_graph | ||
_execute_asset_backfill_iteration_no_side_effects(graphql_context, backfill_id, asset_graph) | ||
_execute_asset_backfill_iteration_no_side_effects(graphql_context, backfill_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this cleanup is really nice
python_modules/dagster/dagster_tests/core_tests/execution_tests/test_asset_backfill.py
Show resolved
Hide resolved
python_modules/dagster/dagster/_core/execution/asset_backfill.py
Outdated
Show resolved
Hide resolved
Summary: Pulling out the new BFS utility added in #25997 to its own function with its own tests to keep the PR size manageable. Test Plan: New tests NOCHANGELOG
Summary: Pulling out the new BFS utility added in #25997 to its own function with its own tests to keep the PR size manageable. Test Plan: New tests NOCHANGELOG ## Summary & Motivation ## How I Tested These Changes ## Changelog > Insert changelog entry or delete this section.
refactor > Insert changelog entry or delete this section. moar refactor > Insert changelog entry or delete this section. subsets instead of values > Insert changelog entry or delete this section. more progress on subsets instead of values > Insert changelog entry or delete this section. more progress > Insert changelog entry or delete this section. yet more progress > Insert changelog entry or delete this section. more progress > Insert changelog entry or delete this section. more > Insert changelog entry or delete this section. renames > Insert changelog entry or delete this section. final names? > Insert changelog entry or delete this section. owen feedback > Insert changelog entry or delete this section.
I saw this comment here about the order in which runs are launched by the backfill daemon #26207 (reply in thread) - wondering if I need to worry at all about that behavior changing / if we were relying on that sort key (which is no longer present) anywhere? |
Looks like we do some additional ordering of the run requests here for single-partition runs, but dont have any similar sorting for ranged-partition runs I'm not sure the sort key was doing anything for the resulting request order. The asset partitions to request that are returned from the BFS search are returned as a set so we'd have no guarantees about order anyway. I think if we wanted to make guarantees about the order run requests are launched, we'd need to do it at the time of constructing the run requests. could still be worth writing a test though |
Summary & Motivation
This PR changes the core asset backfill eligibility logic to make it work with subsets instead of individual asset partition keys. This makes things more complicated (instead of returning a boolean + string in the bfs logic, we have to return the subset that passed and the subsets + reasons that they didn't pass, but much faster for large partition sets. A backfill of 125k partition keys that used to take 20 minutes to compute now takes about 10 seconds.
How I Tested These Changes
asset back, dry runs on existing large backfills
Changelog
Performance improvements for backfills of large partition sets.