Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make asset backfill eligibility logic subset-aware instead of iterating through partition keys individually #25997

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gibsondan
Copy link
Member

@gibsondan gibsondan commented Nov 19, 2024

Summary & Motivation

This PR changes the core asset backfill eligibility logic to make it work with subsets instead of individual asset partition keys. This makes things more complicated (instead of returning a boolean + string in the bfs logic, we have to return the subset that passed and the subsets + reasons that they didn't pass, but much faster for large partition sets. A backfill of 125k partition keys that used to take 20 minutes to compute now takes about 10 seconds.

How I Tested These Changes

asset back, dry runs on existing large backfills

Changelog

Performance improvements for backfills of large partition sets.

@gibsondan gibsondan force-pushed the bfsreal branch 2 times, most recently from de05d19 to 3af0522 Compare November 19, 2024 03:14
@gibsondan gibsondan force-pushed the nonexistant branch 2 times, most recently from 7ad79e0 to 04f3df1 Compare November 19, 2024 22:52
@gibsondan gibsondan force-pushed the nonexistant branch 2 times, most recently from 0e17d76 to 7a03d4f Compare November 22, 2024 23:07
@gibsondan gibsondan changed the base branch from nonexistant to upstreamreal November 24, 2024 16:28
@gibsondan gibsondan changed the title Make asset backfills subset-aware Make asset backfill eligibility logic subset-aware instead of iterating through partition keys individually Nov 25, 2024
@gibsondan gibsondan marked this pull request as ready for review November 25, 2024 04:40
Copy link
Contributor

@OwenKephart OwenKephart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly, I think this is looking really good, but this is a big change to a complicated bit of machinery, so I'm being a bit pickier about naming / code structure just to make sure this is easier to interact with in the future

Comment on lines 198 to 175
serializable_entity_subset = entity_subset.convert_to_serializable_subset()

serializable_entity_subsets = [
SerializableEntitySubset(key=asset_key, value=serializable_entity_subset.value)
for asset_key in execution_set_keys
]

entity_subsets = [
check.not_none(
self._asset_graph_view.get_subset_from_serializable_subset(
serializable_entity_subset
)
)
for serializable_entity_subset in serializable_entity_subsets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

painge, but not really a better option here. could be nice to extract this into a utility (DA has to do something similar), but that doesn't need to be in this PR

@gibsondan gibsondan force-pushed the upstreamreal branch 2 times, most recently from 4c10f54 to 7c76dc2 Compare December 20, 2024 21:54
@gibsondan gibsondan force-pushed the bfsreal branch 2 times, most recently from a2ec297 to a20c4cb Compare December 20, 2024 22:04
Base automatically changed from upstreamreal to master December 20, 2024 23:23
gibsondan added a commit that referenced this pull request Dec 27, 2024
Summary:
Pulling out the new BFS utility added in #25997 to its own function with its own tests to keep the PR size manageable.

Test Plan: New tests

NOCHANGELOG
@gibsondan gibsondan changed the base branch from master to bfsreal0 December 27, 2024 22:45
Copy link
Member Author

@gibsondan gibsondan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the new BFS method into #26751 to make this PR just about backfills and using that method

@gibsondan gibsondan force-pushed the bfsreal branch 2 times, most recently from b93ead9 to 41e786b Compare December 30, 2024 16:40
Copy link
Contributor

@OwenKephart OwenKephart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very exciting!!

Copy link
Contributor

@jamiedemaria jamiedemaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have one test that would be good to add, but otherwise looks good!

repository = code_location.get_repository("test_repo")
asset_graph = repository.asset_graph
_execute_asset_backfill_iteration_no_side_effects(graphql_context, backfill_id, asset_graph)
_execute_asset_backfill_iteration_no_side_effects(graphql_context, backfill_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cleanup is really nice

@gibsondan gibsondan dismissed jamiedemaria’s stale review December 30, 2024 22:29

incorporated feedback

gibsondan added a commit that referenced this pull request Dec 31, 2024
Summary:
Pulling out the new BFS utility added in #25997 to its own function with its own tests to keep the PR size manageable.

Test Plan: New tests

NOCHANGELOG
gibsondan added a commit that referenced this pull request Dec 31, 2024
Summary:
Pulling out the new BFS utility added in
#25997 to its own function
with its own tests to keep the PR size manageable.

Test Plan: New tests

NOCHANGELOG

## Summary & Motivation

## How I Tested These Changes

## Changelog

> Insert changelog entry or delete this section.
Base automatically changed from bfsreal0 to master December 31, 2024 15:56
refactor

> Insert changelog entry or delete this section.

moar refactor

> Insert changelog entry or delete this section.

subsets instead of values

> Insert changelog entry or delete this section.

more progress on subsets instead of values

> Insert changelog entry or delete this section.

more progress

> Insert changelog entry or delete this section.

yet more progress

> Insert changelog entry or delete this section.

more progress

> Insert changelog entry or delete this section.

more

> Insert changelog entry or delete this section.

renames

> Insert changelog entry or delete this section.

final names?

> Insert changelog entry or delete this section.

owen feedback

> Insert changelog entry or delete this section.
@gibsondan
Copy link
Member Author

I saw this comment here about the order in which runs are launched by the backfill daemon #26207 (reply in thread) - wondering if I need to worry at all about that behavior changing / if we were relying on that sort key (which is no longer present) anywhere?

@jamiedemaria
Copy link
Contributor

Looks like we do some additional ordering of the run requests here for single-partition runs, but dont have any similar sorting for ranged-partition runs

I'm not sure the sort key was doing anything for the resulting request order. The asset partitions to request that are returned from the BFS search are returned as a set so we'd have no guarantees about order anyway. I think if we wanted to make guarantees about the order run requests are launched, we'd need to do it at the time of constructing the run requests.

could still be worth writing a test though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants