-
Notifications
You must be signed in to change notification settings - Fork 177
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Experimental BQ support to run dbt models with `ExecutionMode.AIRFLOW…
…_ASYNC` (#1230) Enable BQ users to run dbt models (`full_refresh`) asynchronously. This releases the Airflow worker node from waiting while the transformation (I/O) happens in the dataware house, increasing the overall Airflow task throughput (more information: https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/deferring.html). As part of this change, we introduce the capability of not using the dbt command to run actual SQL transformations. This also avoids creating subprocesses in the worker node (`ExecutionMode.LOCAL` with `InvocationMode. SUBPROCESS` and `ExecutionMode.VIRTUALENV`) or the overhead of creating a Kubernetes Pod to execute the actual dbt command (`ExecutionMode.KUBERNETES`). This can avoid issues related to memory and CPU usage. This PR takes advantage of an already implemented async operator in the Airflow repo by extending it in the Cosmos async operator. It also utilizes the pre-compiled SQL generated as part of the PR #1224. It downloads the generated SQL from a remote location (S3/GCS), which allows us to decouple from dbt during task execution. ## Details - Expose `get_profile_type` on ProfileConfig: This aids in database selection - ~Add `async_op_args`: A high-level parameter to forward arguments to the upstream operator (Airflow operator). (This may change in this PR itself)~ The async operator params are process as kwargs in the operator_args parameter - Implement `DbtRunAirflowAsyncOperator`: This initializes the Airflow Operator, retrieves the SQL query at task runtime from a remote location, modifies the query as needed, and triggers the upstream execute method. ## Limitations - This feature only works when using Airflow 2.8 and above - The async execution only works for BigQuery - The async execution only supports running dbt models (other dbt resources, such as seeds, sources, snapshots, tests, are run using the `ExecutionMode.LOCAL`) - This will work only if the user provides sets `full_refresh=True` in `operator_args` (which means tables will be dropped before being populated, as implemented in `dbt-core`) - Users need to use `ProfileMapping` in `ProfileConfig`, since Cosmos relies on having the connection (credentials) to be able to run the transformation in BQ without `dbt-core` - Users must provide the BQ `location` in `operator_args` (this is a limitation from the `BigQueryInsertJobOperator` that is being used to implement the native Airflow asynchronous support) ## Testing We have added a new dbt project to the repository to facilitate asynchronous task execution. The goal is to accelerate development without disrupting or requiring fixes for the existing tests. Also, we have added DAG for end-to-end testing https://github.com/astronomer/astronomer-cosmos/blob/bd6657a29b111510fc34b2baf0bcc0d65ec0e5b9/dev/dags/simple_dag_async.py ## Configuration Users need to configure the below param to execute deferrable tasks in the Cosmos - [ExecutionMode: AIRFLOW_ASYNC](https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes.html) - [remote_target_path](https://astronomer.github.io/astronomer-cosmos/configuration/cosmos-conf.html#remote-target-path) - [remote_target_path_conn_id](https://astronomer.github.io/astronomer-cosmos/configuration/cosmos-conf.html#remote-target-path-conn-id) Example DAG: https://github.com/astronomer/astronomer-cosmos/blob/bd6657a29b111510fc34b2baf0bcc0d65ec0e5b9/dev/dags/simple_dag_async.py ## Installation You can leverage async operator support by installing an additional dependency ``` astronomer-cosmos[dbt-bigquery, google] ``` ## Documentation The PR also document the limitations and uses of Airflow async execution in the Cosmos. ## Related Issue(s) Related to: #1120 Closes: #1134 ## Breaking Change? No ## Notes This is an experimental feature, and as such, it may undergo breaking changes. We encourage users to share their experiences and feedback to improve it further. We'd love support and feedback so we can define the next steps. ## Checklist - [x] I have made corresponding changes to the documentation (if required) - [x] I have added tests that prove my fix is effective or that my feature works ## Credits This was a result of teamwork and effort: Co-authored-by: Pankaj Koti <[email protected]> Co-authored-by: Tatiana Al-Chueyr <[email protected]> ## Future Work - Design interface to facilitate the easy addition of new asynchronous databases operators #1238 - Improve the test coverage #1239 - Address the limitations (we need to log these issues) --------- Co-authored-by: Pankaj Koti <[email protected]> Co-authored-by: Tatiana Al-Chueyr <[email protected]>
- Loading branch information
1 parent
93eb17e
commit 111d430
Showing
37 changed files
with
1,226 additions
and
118 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,67 +1,190 @@ | ||
from __future__ import annotations | ||
|
||
import inspect | ||
from pathlib import Path | ||
from typing import TYPE_CHECKING, Any, Sequence | ||
|
||
from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook | ||
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator | ||
from airflow.utils.context import Context | ||
|
||
from cosmos import settings | ||
from cosmos.config import ProfileConfig | ||
from cosmos.exceptions import CosmosValueError | ||
from cosmos.operators.base import AbstractDbtBaseOperator | ||
from cosmos.operators.local import ( | ||
DbtBuildLocalOperator, | ||
DbtCompileLocalOperator, | ||
DbtDocsAzureStorageLocalOperator, | ||
DbtDocsGCSLocalOperator, | ||
DbtDocsLocalOperator, | ||
DbtDocsS3LocalOperator, | ||
DbtLocalBaseOperator, | ||
DbtLSLocalOperator, | ||
DbtRunLocalOperator, | ||
DbtRunOperationLocalOperator, | ||
DbtSeedLocalOperator, | ||
DbtSnapshotLocalOperator, | ||
DbtSourceLocalOperator, | ||
DbtTestLocalOperator, | ||
) | ||
from cosmos.settings import remote_target_path, remote_target_path_conn_id | ||
|
||
_SUPPORTED_DATABASES = ["bigquery"] | ||
|
||
class DbtBuildAirflowAsyncOperator(DbtBuildLocalOperator): | ||
pass | ||
from abc import ABCMeta | ||
|
||
|
||
class DbtLSAirflowAsyncOperator(DbtLSLocalOperator): | ||
pass | ||
from airflow.models.baseoperator import BaseOperator | ||
|
||
|
||
class DbtSeedAirflowAsyncOperator(DbtSeedLocalOperator): | ||
pass | ||
|
||
|
||
class DbtSnapshotAirflowAsyncOperator(DbtSnapshotLocalOperator): | ||
pass | ||
|
||
|
||
class DbtSourceAirflowAsyncOperator(DbtSourceLocalOperator): | ||
pass | ||
class DbtBaseAirflowAsyncOperator(BaseOperator, metaclass=ABCMeta): | ||
def __init__(self, **kwargs) -> None: # type: ignore | ||
self.location = kwargs.pop("location") | ||
self.configuration = kwargs.pop("configuration", {}) | ||
super().__init__(**kwargs) | ||
|
||
|
||
class DbtRunAirflowAsyncOperator(DbtRunLocalOperator): | ||
class DbtBuildAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtBuildLocalOperator): # type: ignore | ||
pass | ||
|
||
|
||
class DbtTestAirflowAsyncOperator(DbtTestLocalOperator): | ||
class DbtLSAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtLSLocalOperator): # type: ignore | ||
pass | ||
|
||
|
||
class DbtRunOperationAirflowAsyncOperator(DbtRunOperationLocalOperator): | ||
class DbtSeedAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtSeedLocalOperator): # type: ignore | ||
pass | ||
|
||
|
||
class DbtDocsAirflowAsyncOperator(DbtDocsLocalOperator): | ||
class DbtSnapshotAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtSnapshotLocalOperator): # type: ignore | ||
pass | ||
|
||
|
||
class DbtDocsS3AirflowAsyncOperator(DbtDocsS3LocalOperator): | ||
class DbtSourceAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtSourceLocalOperator): # type: ignore | ||
pass | ||
|
||
|
||
class DbtDocsAzureStorageAirflowAsyncOperator(DbtDocsAzureStorageLocalOperator): | ||
class DbtRunAirflowAsyncOperator(BigQueryInsertJobOperator): # type: ignore | ||
|
||
template_fields: Sequence[str] = ( | ||
"full_refresh", | ||
"project_dir", | ||
"gcp_project", | ||
"dataset", | ||
"location", | ||
) | ||
|
||
def __init__( # type: ignore | ||
self, | ||
project_dir: str, | ||
profile_config: ProfileConfig, | ||
location: str, # This is a mandatory parameter when using BigQueryInsertJobOperator with deferrable=True | ||
full_refresh: bool = False, | ||
extra_context: dict[str, object] | None = None, | ||
configuration: dict[str, object] | None = None, | ||
**kwargs, | ||
) -> None: | ||
# dbt task param | ||
self.project_dir = project_dir | ||
self.extra_context = extra_context or {} | ||
self.full_refresh = full_refresh | ||
self.profile_config = profile_config | ||
if not self.profile_config or not self.profile_config.profile_mapping: | ||
raise CosmosValueError(f"Cosmos async support is only available when using ProfileMapping") | ||
|
||
self.profile_type: str = profile_config.get_profile_type() # type: ignore | ||
if self.profile_type not in _SUPPORTED_DATABASES: | ||
raise CosmosValueError(f"Async run are only supported: {_SUPPORTED_DATABASES}") | ||
|
||
# airflow task param | ||
self.location = location | ||
self.configuration = configuration or {} | ||
self.gcp_conn_id = self.profile_config.profile_mapping.conn_id # type: ignore | ||
profile = self.profile_config.profile_mapping.profile | ||
self.gcp_project = profile["project"] | ||
self.dataset = profile["dataset"] | ||
|
||
# Cosmos attempts to pass many kwargs that BigQueryInsertJobOperator simply does not accept. | ||
# We need to pop them. | ||
clean_kwargs = {} | ||
non_async_args = set(inspect.signature(AbstractDbtBaseOperator.__init__).parameters.keys()) | ||
non_async_args |= set(inspect.signature(DbtLocalBaseOperator.__init__).parameters.keys()) | ||
non_async_args -= {"task_id"} | ||
|
||
for arg_key, arg_value in kwargs.items(): | ||
if arg_key not in non_async_args: | ||
clean_kwargs[arg_key] = arg_value | ||
|
||
# The following are the minimum required parameters to run BigQueryInsertJobOperator using the deferrable mode | ||
super().__init__( | ||
gcp_conn_id=self.gcp_conn_id, | ||
configuration=self.configuration, | ||
location=self.location, | ||
deferrable=True, | ||
**clean_kwargs, | ||
) | ||
|
||
def get_remote_sql(self) -> str: | ||
if not settings.AIRFLOW_IO_AVAILABLE: | ||
raise CosmosValueError(f"Cosmos async support is only available starting in Airflow 2.8 or later.") | ||
from airflow.io.path import ObjectStoragePath | ||
|
||
file_path = self.extra_context["dbt_node_config"]["file_path"] # type: ignore | ||
dbt_dag_task_group_identifier = self.extra_context["dbt_dag_task_group_identifier"] | ||
|
||
remote_target_path_str = str(remote_target_path).rstrip("/") | ||
|
||
if TYPE_CHECKING: | ||
assert self.project_dir is not None | ||
|
||
project_dir_parent = str(Path(self.project_dir).parent) | ||
relative_file_path = str(file_path).replace(project_dir_parent, "").lstrip("/") | ||
remote_model_path = f"{remote_target_path_str}/{dbt_dag_task_group_identifier}/compiled/{relative_file_path}" | ||
|
||
object_storage_path = ObjectStoragePath(remote_model_path, conn_id=remote_target_path_conn_id) | ||
with object_storage_path.open() as fp: # type: ignore | ||
return fp.read() # type: ignore | ||
|
||
def drop_table_sql(self) -> None: | ||
model_name = self.extra_context["dbt_node_config"]["resource_name"] # type: ignore | ||
sql = f"DROP TABLE IF EXISTS {self.gcp_project}.{self.dataset}.{model_name};" | ||
|
||
hook = BigQueryHook( | ||
gcp_conn_id=self.gcp_conn_id, | ||
impersonation_chain=self.impersonation_chain, | ||
) | ||
self.configuration = { | ||
"query": { | ||
"query": sql, | ||
"useLegacySql": False, | ||
} | ||
} | ||
hook.insert_job(configuration=self.configuration, location=self.location, project_id=self.gcp_project) | ||
|
||
def execute(self, context: Context) -> Any | None: | ||
if not self.full_refresh: | ||
raise CosmosValueError("The async execution only supported for full_refresh") | ||
else: | ||
# It may be surprising to some, but the dbt-core --full-refresh argument fully drops the table before populating it | ||
# https://github.com/dbt-labs/dbt-core/blob/5e9f1b515f37dfe6cdae1ab1aa7d190b92490e24/core/dbt/context/base.py#L662-L666 | ||
# https://docs.getdbt.com/reference/resource-configs/full_refresh#recommendation | ||
# We're emulating this behaviour here | ||
self.drop_table_sql() | ||
sql = self.get_remote_sql() | ||
model_name = self.extra_context["dbt_node_config"]["resource_name"] # type: ignore | ||
# prefix explicit create command to create table | ||
sql = f"CREATE TABLE {self.gcp_project}.{self.dataset}.{model_name} AS {sql}" | ||
self.configuration = { | ||
"query": { | ||
"query": sql, | ||
"useLegacySql": False, | ||
} | ||
} | ||
return super().execute(context) | ||
|
||
|
||
class DbtTestAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtTestLocalOperator): # type: ignore | ||
pass | ||
|
||
|
||
class DbtDocsGCSAirflowAsyncOperator(DbtDocsGCSLocalOperator): | ||
class DbtRunOperationAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtRunOperationLocalOperator): # type: ignore | ||
pass | ||
|
||
|
||
class DbtCompileAirflowAsyncOperator(DbtCompileLocalOperator): | ||
class DbtCompileAirflowAsyncOperator(DbtBaseAirflowAsyncOperator, DbtCompileLocalOperator): # type: ignore | ||
pass |
Oops, something went wrong.