[rfc] Implement Python API for launching asset backfills #26731

deepyaman · 2024-12-27T04:18:18Z

Summary & Motivation

Background

@smackesey has interacted with several users on the following scenario:

There is some computation (run or backfill) X that should only be launched under certain conditions
The conditions under which X should be launched are non-trivial to compute
It is desired to manually trigger conditional launch of X-- that is, on clicking a button, the conditions under which X should be launched are computed and X is launched if those conditions are satisfied.

Problem Statement

We do not have a good solution to the above scenario.

Launch job Y that conditionally launches X
- If X is a job, execute_job will work.
- If X is a backfill, then there is no good public API to launch it. User must use the private GQL API.
Manually trigger sensor that conditionally launches X
- The way you manually trigger a sensor is by pressing the "Test Sensor" button, which is non-ideal and kinda buried.
- You then have to pass each generated RunRequest through to the launchpad and manually launch.
  - This works OK if X is a job.
  - This is unworkable if X is a backfill, because you have to manually launch each RunRequest

So if X is a backfill, basically your only option is to use a job that then uses the private GQL API.

Objective

The purpose of this enhancement is to improve user experience. This will allow users to not have to resort to workarounds to perform a manual trigger a conditional launch of X.

The manifestation of this is to expose a public Python API for creating and launching asset backfills.

How I Tested These Changes

Added unit tests mirroring a couple of existing backfill tests, but using the newly-added Python API.

TODO

I wanted to get feedback/validation that I'm on the right track before proceeding much further, but I understand that the following things probably still need doing:

Test validation of arguments/argument combinations
Add additional execution tests?
Improve docstring (with examples, etc.)
Add public-facing docs?

Changelog

Exposed a Python API for creating and launching asset backfills.

deepyaman · 2024-12-27T22:35:51Z

python_modules/dagster-graphql/dagster_graphql/implementation/execution/backfill.py

    ) and partitions_by_assets:
        raise DagsterInvariantViolationError(
-            "partitions_by_assets cannot be used together with asset_selection, selector, or"
-            " partitionNames"
+            "partitionsByAssets cannot be used together with assetSelection, selector, or"


At the very least, I feel like the capitalization should be consistent; the camel-casing is aligned with the GraphQL API names.

deepyaman · 2024-12-27T22:36:57Z

python_modules/dagster-graphql/dagster_graphql/implementation/execution/backfill.py

-            "partitions_by_assets cannot be used together with asset_selection, selector, or"
-            " partitionNames"
+            "partitionsByAssets cannot be used together with assetSelection, selector, or"
+            " partitionNames or if allPartitions is True"


I think it makes sense to validate allPartitions, too, if checking partitionNames; neither can be used with partitionsByAssets.

I would be happy to split this and the above change into a separate (stacked?) PR, since they're not directly related to creating the Python API; just thought I'd do some housekeeping as I was learning about it for the Python API implementation.

deepyaman · 2024-12-28T00:02:23Z

python_modules/dagster-graphql/dagster_graphql/implementation/execution/backfill.py

+            asset_graph=asset_graph,
            backfill_timestamp=backfill_timestamp,
+            tags=tags,


Nit: Minimal argument order alignment across calls.

deepyaman · 2024-12-28T00:02:59Z

python_modules/dagster/dagster/_core/instance/__init__.py

@@ -2260,7 +2262,7 @@ def get_event_tags_for_asset(
    @public
    @traced
    def wipe_assets(self, asset_keys: Sequence[AssetKey]) -> None:
-        """Wipes asset event history from the event log for the given asset keys.
+        """Wipe asset event history from the event log for the given asset keys.


Nit: All of the other public methods in this file use imperative mood.

deepyaman · 2024-12-28T00:05:04Z

python_modules/dagster/dagster/_core/instance/__init__.py

+    @public
+    def launch_backfill(
+        self,
+        asset_graph: "BaseAssetGraph",


I decided to directly ask for the asset graph as an argument, since there are already tutorials that ask the user to pass asset graph. I didn't really see any reason to ask for a broader Definitions object.

deepyaman · 2024-12-28T00:06:08Z

python_modules/dagster/dagster/_core/instance/__init__.py

+        Returns:
+            str: The ID of the backfill.
+        """
+        # TODO(deepyaman): Abstract logic shared with `dagster-graphql`.


Suggested change

# TODO(deepyaman): Abstract logic shared with `dagster-graphql`.

Actually, as I got to implementing more of this, maybe they're sufficiently different that this isn't worth it.

deepyaman · 2024-12-28T00:20:46Z

python_modules/dagster/dagster/_core/instance/__init__.py

+        partitions_by_assets: Optional[
+            Sequence[Tuple[AssetKey, Optional[Sequence[Tuple[str, str]]]]]
+        ] = None,


This avoids exposing any additional objects, but I suppose an explanation/examples will be necessary for anybody unfamiliar with Dagster internals already.

deepyaman · 2024-12-28T01:44:37Z

python_modules/dagster-graphql/dagster_graphql/implementation/execution/backfill.py

        )

    tags = {t["key"]: t["value"] for t in backfill_params.get("tags", [])}

    tags = {**tags, **graphene_info.context.get_viewer_tags()}

+    title = check_valid_title(backfill_params.get("title"))


It seems reasonable to use the existing utility. Also, I think there's no reason it should only be happening for job backfills.

This is another one I'd be happy to move into a separate PR, if so desired.

doing it for only job backfills is 100% just a goof on my part. we should be doing it for all types of backfills - thanks for catching!

switching to the check version of the fn is also fine w me. I think originally we were considering re-raising as a GrapheneError so that it had a more immediate failure for the user. But we've kind of abandoned adding titles/descriptions to backfills for now anyway, so keeping it as the generic python error is fine

@jamiedemaria Great! I raised a new PR just addressing this (tagged you, if you can take a quick look), since it's agreed upon, while I go about changing this broader PR.

deepyaman · 2024-12-28T02:07:46Z

python_modules/dagster/dagster/_core/instance/__init__.py

+            )
+
+        # Check that the partition keys can be found in the asset graph.
+        # TODO(deepyaman): Abstract logic shared with `dagster-graphql`.


I think this one can definitely be deduplicated; probably can be done in a subsequent PR?

OwenKephart

After sleeping on this, I feel like putting this logic on the DagsterInstance itself isn't quite right.

At the core, there are two things we want to be able to do here:

Express some intent to backfill some arbitrary set of partitions
Send a message to the instance to act on that intent

We already have an existing method for (2) (instance.add_backfill), and so the question is what is the best way of exposing (1) to the user.

We already have the PartitionBackfill object, and so putting this layer of indirection for constructing the partition backfill feels off, especially because none of the code in the added method really interacts with the core instance database outside of a single call to self.add_backfill().

Right now, this single method supports a wide array of different parameter combinations, with only some subset of them being valid, which is pretty confusing for a user, even if we add a ton of examples of valid behavior.

I think the better way of factoring this behavior would be to just expose PartitionBackfill more directly, and make the existing static constructors public (with a potentially a bit of light refactoring of the input types, e.g. backfill_id could be an optional parameter -- if not provided, we'd create one for the user).

To start, we could just expose .from_asset_partitions() as public, as I think that's the most common case, and its existing implementation is less subject to change than the from_partitions_by_assets (which has extra custom types)

jamiedemaria · 2024-12-31T16:31:04Z

++1 I think exposing the constructors on PartitionBackfill is a lot cleaner and gives us more flexibility in the future

jamiedemaria

Have there been discussions in the build team about the implications of allowing users to launch backfills (and other jobs) from runs? I know in the past we've shied away from allowing/promoting this behavior and instead pointed users to sensors. Would love to get up to date on what you are all thinking in terms of best practices here

jamiedemaria · 2024-12-31T16:16:23Z

python_modules/dagster-graphql/dagster_graphql/implementation/execution/backfill.py

        )

    tags = {t["key"]: t["value"] for t in backfill_params.get("tags", [])}

    tags = {**tags, **graphene_info.context.get_viewer_tags()}

+    title = check_valid_title(backfill_params.get("title"))


doing it for only job backfills is 100% just a goof on my part. we should be doing it for all types of backfills - thanks for catching!

switching to the check version of the fn is also fine w me. I think originally we were considering re-raising as a GrapheneError so that it had a more immediate failure for the user. But we've kind of abandoned adding titles/descriptions to backfills for now anyway, so keeping it as the generic python error is fine

## Summary & Motivation Fix a mistaken inconsistency; see #26731 (comment). ## How I Tested These Changes BK ## Changelog Added validation of `title` for asset backfills (not just for job backfills).

deepyaman force-pushed the deepyaman/feat/python-backfill-api branch 2 times, most recently from cddb72a to f7133d7 Compare December 27, 2024 14:14

deepyaman commented Dec 27, 2024

View reviewed changes

deepyaman commented Dec 28, 2024

View reviewed changes

deepyaman changed the title ~~Implement Python API for launching asset backfills~~ [rfc] Implement Python API for launching asset backfills Dec 28, 2024

deepyaman commented Dec 28, 2024

View reviewed changes

deepyaman marked this pull request as ready for review December 28, 2024 02:07

deepyaman requested review from OwenKephart, smackesey and jamiedemaria December 28, 2024 02:08

deepyaman force-pushed the deepyaman/feat/python-backfill-api branch from ab28602 to 4423290 Compare December 31, 2024 04:19

OwenKephart requested changes Dec 31, 2024

View reviewed changes

jamiedemaria requested changes Dec 31, 2024

View reviewed changes

deepyaman mentioned this pull request Dec 31, 2024

Also validate backfill title for asset backfills #26767

Merged

deepyaman added 7 commits December 31, 2024 17:16

Implement Python API for launching asset backfills

91dccec

Create PartitionsByAssetSelector list from input

fe36399

Return backfill ID, test from partitions by assets

10cb3f6

Test backfill from asset partitions, validate more

879f946

Check title for asset backfills, and in Python API

43d8372

Check valid asset partition backfill in Python API

a795657

Accept set of keys instead of partition key ranges

2186ef0

deepyaman force-pushed the deepyaman/feat/python-backfill-api branch from 4423290 to 2186ef0 Compare January 1, 2025 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rfc] Implement Python API for launching asset backfills #26731

[rfc] Implement Python API for launching asset backfills #26731

deepyaman commented Dec 27, 2024 •

edited

Loading

deepyaman Dec 27, 2024

deepyaman Dec 27, 2024

deepyaman Dec 28, 2024

deepyaman Dec 28, 2024

deepyaman Dec 28, 2024

deepyaman Dec 28, 2024

deepyaman Dec 28, 2024

deepyaman Dec 28, 2024 •

edited

Loading

jamiedemaria Dec 31, 2024

deepyaman Dec 31, 2024

deepyaman Dec 28, 2024

OwenKephart left a comment

jamiedemaria commented Dec 31, 2024

jamiedemaria left a comment

jamiedemaria Dec 31, 2024

[rfc] Implement Python API for launching asset backfills #26731

Are you sure you want to change the base?

[rfc] Implement Python API for launching asset backfills #26731

Conversation

deepyaman commented Dec 27, 2024 • edited Loading

Summary & Motivation

Background

Problem Statement

Objective

How I Tested These Changes

TODO

Changelog

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepyaman Dec 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OwenKephart left a comment

Choose a reason for hiding this comment

jamiedemaria commented Dec 31, 2024

jamiedemaria left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepyaman commented Dec 27, 2024 •

edited

Loading

deepyaman Dec 28, 2024 •

edited

Loading