From afa635d8c0922be74c24ada44d0465f73637d715 Mon Sep 17 00:00:00 2001 From: Pankaj Koti Date: Thu, 3 Oct 2024 22:13:16 +0530 Subject: [PATCH] Refactor docs for async mode execution (#1241) Following up on the documentation added in PRs #1224 and #1230, this PR refactors the documentation for Async Execution mode, particularly the limitations section. It also addresses a couple of un-rendered items in the scheduling.rst file, caused by missing blank lines after the code-block directive. --- docs/configuration/scheduling.rst | 3 +++ docs/getting_started/execution-modes.rst | 18 +++++++++--------- 2 files changed, 12 insertions(+), 9 deletions(-) diff --git a/docs/configuration/scheduling.rst b/docs/configuration/scheduling.rst index 60e466d34..b5d2c1821 100644 --- a/docs/configuration/scheduling.rst +++ b/docs/configuration/scheduling.rst @@ -20,6 +20,7 @@ To schedule a dbt project on a time-based schedule, you can use Airflow's schedu schedule="@daily", ) +.. _data-aware-scheduling: Data-Aware Scheduling --------------------- @@ -77,6 +78,7 @@ If using cosmos with an Airflow 2.9 or below, users will experience the followin Example of scheduler logs: .. code-block:: + scheduler | [2023-09-08T10:18:34.252+0100] {scheduler_job_runner.py:1742} INFO - Orphaning unreferenced dataset 'postgres://0.0.0.0:5432/postgres.public.stg_customers' scheduler | [2023-09-08T10:18:34.252+0100] {scheduler_job_runner.py:1742} INFO - Orphaning unreferenced dataset 'postgres://0.0.0.0:5432/postgres.public.stg_payments' scheduler | [2023-09-08T10:18:34.252+0100] {scheduler_job_runner.py:1742} INFO - Orphaning unreferenced dataset 'postgres://0.0.0.0:5432/postgres.public.stg_orders' @@ -105,5 +107,6 @@ For users to overcome this limitation in local tests, until the Airflow communit they can set this configuration to ``False``. It can also be set in the ``airflow.cfg`` file: .. code-block:: + [cosmos] enable_dataset_alias = False diff --git a/docs/getting_started/execution-modes.rst b/docs/getting_started/execution-modes.rst index 10f6cce67..9f40d7348 100644 --- a/docs/getting_started/execution-modes.rst +++ b/docs/getting_started/execution-modes.rst @@ -267,17 +267,17 @@ machine it took approximately 25 seconds for the task to compile & upload the co however, it is still a win as it is one-time overhead and the subsequent tasks run asynchronously utilising the Airflow's deferrable operators and supplying to them those compiled SQLs. -Note that currently, the ``airflow_async`` execution mode has the following limitations and is released as Experimental: +Note that currently, the ``airflow_async`` execution mode has the following limitations and is released as **Experimental**: -1. This feature only works when using Airflow 2.8 and above -2. Only supports the ``dbt resource type`` models to be run asynchronously using Airflow deferrable operators. All other resources are executed synchronously using dbt commands as they are in the ``local`` execution mode. -3. Only supports BigQuery as the target database. If a profile target other than BigQuery is specified, Cosmos will error out saying that the target database is not supported with this execution mode. -4. Only works for ``full_refresh`` models. There is pending work to support other modes. -5. Only Support for the Bigquery profile type -6. Users need to provide ProfileMapping parameter in ProfileConfig -7. It does not support dataset +1. **Airflow 2.8 or higher required**: This mode relies on Airflow's `Object Storage `__ feature, introduced in Airflow 2.8, to store and retrieve compiled SQLs. +2. **Limited to dbt models**: Only dbt resource type models are run asynchronously using Airflow deferrable operators. Other resource types are executed synchronously, similar to the local execution mode. +3. **BigQuery support only**: This mode only supports BigQuery as the target database. If a different target is specified, Cosmos will throw an error indicating the target database is unsupported in this mode. +4. **ProfileMapping parameter required**: You need to specify the ``ProfileMapping`` parameter in the ``ProfileConfig`` for your DAG. Refer to the example DAG below for details on setting this parameter. +5. **Supports only full_refresh models**: Currently, only ``full_refresh`` models are supported. To enable this, pass ``full_refresh=True`` in the ``operator_args`` of the ``DbtDag`` or ``DbtTaskGroup``. Refer to the example DAG below for details on setting this parameter. +6. **location parameter required**: You must specify the location of the BigQuery dataset in the ``operator_args`` of the ``DbtDag`` or ``DbtTaskGroup``. The example DAG below provides guidance on this. +7. **No dataset emission**: The async run operators do not currently emit datasets, meaning that :ref:`data-aware-scheduling` is not supported at this time. Future releases will address this limitation. -You can leverage async operator support by installing an additional dependency +To start leveraging async execution mode that is currently supported for the BigQuery profile type targets you need to install Cosmos with the below additional dependencies: .. code:: bash