Fixing collecton ingest DAG #261

amarouane-ABDELHAK · 2024-11-25T15:46:48Z

Summary: Summary of changes

Fixing collection ingest

Changes

Ingest Collection DAG

ividito

Is there additional context to this PR? I didn't know we had issues with collection ingests.

ividito · 2024-11-25T17:32:03Z

dags/veda_data_pipeline/veda_dataset_pipeline.py

+    run_discover_build_and_push = TriggerMultiDagRunOperator(
+        task_id="trigger_discover_items_dag",
+        dag=dag,
+        trigger_dag_id="veda_discover",
+        python_callable=trigger_discover_and_build_task,
+    )


We should not use this operator, as it creates a disconnect between the original DAGRun event and subsequent DAG runs (ie. a failure in an instance veda-discover will not feed back to the original veda-dataset DAG using this operator). We could instead describe the discover pipeline in a TaskGroup and call expand() on it, which will map task instances to the correct event.

I want to re-use discover_build_ingest DAG instead of redefining the same tasks. But having a task group used in both DAGs is not a bad idea

anayeaye · 2024-11-25T20:12:05Z

sm2a/airflow_worker/Dockerfile

@@ -41,7 +41,9 @@ COPY --chown=airflow:airflow scripts "${AIRFLOW_HOME}/scripts"

 RUN cp ${AIRFLOW_HOME}/configuration/airflow.cfg* ${AIRFLOW_HOME}/.

-RUN pip install pypgstac==0.7.4
+# Commited because it downgrade pydentics to v1


This is an artifact and we should not need pypgstac in airflow--that should be entirely handled by the ingest api

fix: remove pypgstac install

…s in collection generation dag

fix: remove python version dependent datetime formatter in collection generation

Collection generation print statement

Remove comma in collection generation

ividito · 2024-12-02T19:23:42Z

dags/veda_data_pipeline/groups/collection_group.py

+        @task()
+        def generate_collection_task(ti):
+
+            config = ti.dag_run.conf
+            airflow_vars_json = Variable.get("aws_dags_variables", deserialize_json=True)
+            role_arn = airflow_vars_json.get("ASSUME_ROLE_READ_ARN")
+
+            # TODO it would be ideal if this also works with complete collections where provided - this would make the collection ingest more re-usable
+            generator = GenerateCollection()
+            collection = generator.generate_stac(
+                dataset_config=config, role_arn=role_arn
+            )
+            return collection
+
+        @task()
+        def ingest_collection_task(collection):
+            """
+            Ingest a collection into the STAC catalog
+
+            Args:
+                collection:
+
+            """
+            airflow_vars_json = Variable.get("aws_dags_variables", deserialize_json=True)
+            cognito_app_secret = airflow_vars_json.get("COGNITO_APP_SECRET")
+            stac_ingestor_api_url = airflow_vars_json.get("STAC_INGESTOR_API_URL")
+
+            return submission_handler(
+                event=collection,
+                endpoint="/collections",
+                cognito_app_secret=cognito_app_secret,
+                stac_ingestor_api_url=stac_ingestor_api_url
+            )


Let's define these task() functions outside of the TaskGroup, so they can be re-used on their own as well.

^^ this is a breaking change for stactools, which come with their own collection generators, but re-use our collection submission task

ividito · 2024-12-02T19:32:26Z

dags/veda_data_pipeline/veda_dataset_pipeline.py

+    run_discover_build_and_push = TriggerDagRunOperator.partial(
+        task_id="trigger_discover_items_dag",
+        trigger_dag_id="veda_discover",
+        wait_for_completion=True,

-    collection_grp.set_upstream(start)
-    submit_stac.set_downstream(end)
+    ).expand(conf=items) >> end


It looks like this will allow the parent DAG to properly track the state of mapped+triggered DAGs, but still recreates a disconnect (the link created does not associate DAG runs with a parent DAG, it only associates the DAG itself with its parent).

Co-authored-by: Alexandra Kirk <[email protected]>

ividito · 2024-12-06T13:45:08Z

dags/veda_data_pipeline/groups/discover_group.py

+def get_files_task(payload, ti=None):
+    """
+    Get files from S3 produced by discovery or dataset tasks.
+    Handles both single payload and multiple payload scenarios.
+    """


🚀, this is much nicer than what we were doing before

dags/veda_data_pipeline/veda_discover_pipeline.py

dags/veda_data_pipeline/veda_dataset_pipeline.py

ividito · 2024-12-06T14:16:42Z

dags/veda_data_pipeline/groups/collection_group.py

+        @task()
+        def generate_collection_task(ti):
+
+            config = ti.dag_run.conf
+            airflow_vars_json = Variable.get("aws_dags_variables", deserialize_json=True)
+            role_arn = airflow_vars_json.get("ASSUME_ROLE_READ_ARN")
+
+            # TODO it would be ideal if this also works with complete collections where provided - this would make the collection ingest more re-usable
+            generator = GenerateCollection()
+            collection = generator.generate_stac(
+                dataset_config=config, role_arn=role_arn
+            )
+            return collection
+
+        @task()
+        def ingest_collection_task(collection):
+            """
+            Ingest a collection into the STAC catalog
+
+            Args:
+                collection:
+
+            """
+            airflow_vars_json = Variable.get("aws_dags_variables", deserialize_json=True)
+            cognito_app_secret = airflow_vars_json.get("COGNITO_APP_SECRET")
+            stac_ingestor_api_url = airflow_vars_json.get("STAC_INGESTOR_API_URL")
+
+            return submission_handler(
+                event=collection,
+                endpoint="/collections",
+                cognito_app_secret=cognito_app_secret,
+                stac_ingestor_api_url=stac_ingestor_api_url
+            )


^^ this is a breaking change for stactools, which come with their own collection generators, but re-use our collection submission task

ividito · 2024-12-06T14:23:08Z

dags/veda_data_pipeline/veda_dataset_pipeline.py

+    run_discover_build_and_push = TriggerDagRunOperator.partial(
+        task_id="trigger_discover_items_dag",
+        trigger_dag_id="veda_discover",
+        wait_for_completion=True
+    ).expand(conf=mutated_payloads) >> end


I get that this promotes reuseability at the level of the discover DAG, but I worry that this creates too much of a breakdown in reuse and observability at the task level.

Rather than passing task-specific data structures as parameters, we end up relying on ti.dag_run.conf. This means that, unless we always build and ingest items through the discover DAG, we cannot rely on the data format being the same for different DAGs with different ingestion strategies. This will eventually bring us back to the same solution we have now, where new DAGs skip using the discover DAG, and simply reuse the tasks with additional wrapper steps to manipulate the input. We should try to lean into this - the changes you made to get_files() is a great example, where we promote modularity and reuse at the task level, so that it doesn't matter what the incoming dag_run.conf is.

Similarly, this condenses task status into a single node in the dataset DAG. Meanwhile, in the triggered discover DAG, there are several expanded steps with unique failure conditions. This means that a failure in one step in one triggered DAG will require a retry of the complete DAG, rather than a single task.

TriggerDagRunOperator adds a link between DAGs, but not between executions. I was hoping this would work better with expand(), but unfortunately we're out of luck. This (and a few similar issues) is being tracked as a bug in Airflow, but until it's fixed, I think we should steer clear so as to maintain execution-level observability, especially in DAG runs with a large number of mapped discovery tasks.

Yes, I agree that traceability will be necessary for debugging any issues. Some have reported that this problem is fixed in Airflow 2.10. However, we still need to promote task reusability to avoid defining the same logic twice. I will close this PR and open a new one with some refactoring to make debugging and development easier.

Closing this in favor of #268

Fixing collecton ingest DAG

0be8c4c

amarouane-ABDELHAK requested review from smohiudd, anayeaye and ividito November 25, 2024 15:46

ividito requested changes Nov 25, 2024

View reviewed changes

anayeaye reviewed Nov 25, 2024

View reviewed changes

anayeaye and others added 10 commits November 25, 2024 13:29

fix(dockerfile) remove pypgstac install

fe09938

fix(dag-requirements) remove pypgstac install

56738e9

Merge pull request #262 from NASA-IMPACT/fix/remove-pypgstac

66193a8

fix: remove pypgstac install

let ingest api handle datetime validation and just pass through input…

4fbd229

…s in collection generation dag

Merge pull request #263 from NASA-IMPACT/fix/collection-datetime-format

d9799cc

fix: remove python version dependent datetime formatter in collection generation

temp print statement

926be29

Merge pull request #264 from NASA-IMPACT/fix/collection-print

7e05d2f

Collection generation print statement

remove comma

2974701

Merge pull request #265 from NASA-IMPACT/fix/collection-print

5d8757e

Remove comma in collection generation

Add DAG run dependency

948aaec

ividito requested changes Dec 2, 2024

View reviewed changes

ividito and others added 7 commits December 3, 2024 10:40

Strip thumbnail assets from payload before discovery

64683af

rename var

9d7c611

Update dags/veda_data_pipeline/veda_dataset_pipeline.py

3a48975

Co-authored-by: Alexandra Kirk <[email protected]>

Fix callable

0e5cc7c

Fixing collection ingest

0556329

Fixing collection ingest

9024bb8

Fixing collection ingest

195f659

amarouane-ABDELHAK requested review from anayeaye and ividito December 5, 2024 22:04

ividito requested changes Dec 6, 2024

View reviewed changes

Remove old comment

ff1aa2a

amarouane-ABDELHAK closed this Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing collecton ingest DAG #261

Fixing collecton ingest DAG #261

amarouane-ABDELHAK commented Nov 25, 2024

ividito left a comment

ividito Nov 25, 2024

amarouane-ABDELHAK Nov 25, 2024

anayeaye Nov 25, 2024

ividito Dec 2, 2024

ividito Dec 6, 2024

ividito Dec 2, 2024

ividito Dec 6, 2024

ividito Dec 6, 2024

ividito Dec 6, 2024

amarouane-ABDELHAK Dec 6, 2024

amarouane-ABDELHAK Dec 6, 2024

Fixing collecton ingest DAG #261

Fixing collecton ingest DAG #261

Conversation

amarouane-ABDELHAK commented Nov 25, 2024

Changes

ividito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment