Skip to content

Commit

Permalink
ci: separate airflow build and test (datahub-project#8688)
Browse files Browse the repository at this point in the history
Co-authored-by: Harshal Sheth <[email protected]>
  • Loading branch information
mayurinehate and hsheth2 authored Aug 30, 2023
1 parent 1282e5b commit e867dbc
Show file tree
Hide file tree
Showing 52 changed files with 2,037 additions and 1,874 deletions.
85 changes: 85 additions & 0 deletions .github/workflows/airflow-plugin.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
name: Airflow Plugin
on:
push:
branches:
- master
paths:
- ".github/workflows/airflow-plugin.yml"
- "metadata-ingestion-modules/airflow-plugin/**"
- "metadata-ingestion/**"
- "metadata-models/**"
pull_request:
branches:
- master
paths:
- ".github/**"
- "metadata-ingestion-modules/airflow-plugin/**"
- "metadata-ingestion/**"
- "metadata-models/**"
release:
types: [published]

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
airflow-plugin:
runs-on: ubuntu-latest
env:
SPARK_VERSION: 3.0.3
DATAHUB_TELEMETRY_ENABLED: false
strategy:
matrix:
include:
- python-version: "3.7"
extraPythonRequirement: "apache-airflow~=2.1.0"
- python-version: "3.7"
extraPythonRequirement: "apache-airflow~=2.2.0"
- python-version: "3.10"
extraPythonRequirement: "apache-airflow~=2.4.0"
- python-version: "3.10"
extraPythonRequirement: "apache-airflow~=2.6.0"
- python-version: "3.10"
extraPythonRequirement: "apache-airflow>2.6.0"
fail-fast: false
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
- name: Install dependencies
run: ./metadata-ingestion/scripts/install_deps.sh
- name: Install airflow package and test (extras ${{ matrix.extraPythonRequirement }})
run: ./gradlew -Pextra_pip_requirements='${{ matrix.extraPythonRequirement }}' :metadata-ingestion-modules:airflow-plugin:lint :metadata-ingestion-modules:airflow-plugin:testQuick
- name: pip freeze show list installed
if: always()
run: source metadata-ingestion-modules/airflow-plugin/venv/bin/activate && pip freeze
- uses: actions/upload-artifact@v3
if: ${{ always() && matrix.python-version == '3.10' && matrix.extraPythonRequirement == 'apache-airflow>2.6.0' }}
with:
name: Test Results (Airflow Plugin ${{ matrix.python-version}})
path: |
**/build/reports/tests/test/**
**/build/test-results/test/**
**/junit.*.xml
- name: Upload coverage to Codecov
if: always()
uses: codecov/codecov-action@v3
with:
token: ${{ secrets.CODECOV_TOKEN }}
directory: .
fail_ci_if_error: false
flags: airflow-${{ matrix.python-version }}-${{ matrix.extraPythonRequirement }}
name: pytest-airflow
verbose: true

event-file:
runs-on: ubuntu-latest
steps:
- name: Upload
uses: actions/upload-artifact@v3
with:
name: Event File
path: ${{ github.event_path }}
7 changes: 2 additions & 5 deletions .github/workflows/metadata-ingestion.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,7 @@ jobs:
]
include:
- python-version: "3.7"
extraPythonRequirement: "sqlalchemy==1.3.24 apache-airflow~=2.2.0"
- python-version: "3.10"
extraPythonRequirement: "sqlalchemy~=1.4.0 apache-airflow>=2.4.0"
fail-fast: false
steps:
- uses: actions/checkout@v3
Expand All @@ -56,8 +54,8 @@ jobs:
run: ./metadata-ingestion/scripts/install_deps.sh
- name: Install package
run: ./gradlew :metadata-ingestion:installPackageOnly
- name: Run metadata-ingestion tests (extras ${{ matrix.extraPythonRequirement }})
run: ./gradlew -Pextra_pip_requirements='${{ matrix.extraPythonRequirement }}' :metadata-ingestion:${{ matrix.command }}
- name: Run metadata-ingestion tests
run: ./gradlew :metadata-ingestion:${{ matrix.command }}
- name: pip freeze show list installed
if: always()
run: source metadata-ingestion/venv/bin/activate && pip freeze
Expand All @@ -80,7 +78,6 @@ jobs:
name: pytest-${{ matrix.command }}
verbose: true


event-file:
runs-on: ubuntu-latest
steps:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test-results.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: Test Results

on:
workflow_run:
workflows: ["build & test", "metadata ingestion"]
workflows: ["build & test", "metadata ingestion", "Airflow Plugin"]
types:
- completed

Expand Down
6 changes: 3 additions & 3 deletions docs/lineage/airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ lazy_load_plugins = False
| datahub.capture_executions | true | If true, we'll capture task runs in DataHub in addition to DAG definitions. |
| datahub.graceful_exceptions | true | If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions. |

5. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
5. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
6. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.

### How to validate installation
Expand Down Expand Up @@ -160,14 +160,14 @@ pip install acryl-datahub[airflow,datahub-kafka]
- `capture_executions` (defaults to false): If true, it captures task runs as DataHub DataProcessInstances.
- `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.

4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
5. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.

## Emitting lineage via a separate operator

Take a look at this sample DAG:

- [`lineage_emission_dag.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.
- [`lineage_emission_dag.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.

In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.

Expand Down
59 changes: 38 additions & 21 deletions metadata-ingestion-modules/airflow-plugin/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -7,47 +7,62 @@ ext {
venv_name = 'venv'
}

if (!project.hasProperty("extra_pip_requirements")) {
ext.extra_pip_requirements = ""
}

def pip_install_command = "${venv_name}/bin/pip install -e ../../metadata-ingestion"

task checkPythonVersion(type: Exec) {
commandLine python_executable, '-c', 'import sys; assert sys.version_info >= (3, 7)'
}

task environmentSetup(type: Exec, dependsOn: checkPythonVersion) {
def sentinel_file = "${venv_name}/.venv_environment_sentinel"
inputs.file file('setup.py')
outputs.dir("${venv_name}")
commandLine 'bash', '-c', "${python_executable} -m venv ${venv_name} && ${venv_name}/bin/python -m pip install --upgrade pip wheel 'setuptools>=63.0.0'"
outputs.file(sentinel_file)
commandLine 'bash', '-c',
"${python_executable} -m venv ${venv_name} &&" +
"${venv_name}/bin/python -m pip install --upgrade pip wheel 'setuptools>=63.0.0' && " +
"touch ${sentinel_file}"
}

task installPackage(type: Exec, dependsOn: environmentSetup) {
task installPackage(type: Exec, dependsOn: [environmentSetup, ':metadata-ingestion:codegen']) {
def sentinel_file = "${venv_name}/.build_install_package_sentinel"
inputs.file file('setup.py')
outputs.dir("${venv_name}")
outputs.file(sentinel_file)
// Workaround for https://github.com/yaml/pyyaml/issues/601.
// See https://github.com/yaml/pyyaml/issues/601#issuecomment-1638509577.
// and https://github.com/datahub-project/datahub/pull/8435.
commandLine 'bash', '-x', '-c',
"${pip_install_command} install 'Cython<3.0' 'PyYAML<6' --no-build-isolation && " +
"${pip_install_command} -e ."
"${pip_install_command} -e . ${extra_pip_requirements} &&" +
"touch ${sentinel_file}"
}

task install(dependsOn: [installPackage])

task installDev(type: Exec, dependsOn: [install]) {
def sentinel_file = "${venv_name}/.build_install_dev_sentinel"
inputs.file file('setup.py')
outputs.dir("${venv_name}")
outputs.file("${venv_name}/.build_install_dev_sentinel")
outputs.file("${sentinel_file}")
commandLine 'bash', '-x', '-c',
"${pip_install_command} -e .[dev] && touch ${venv_name}/.build_install_dev_sentinel"
"${pip_install_command} -e .[dev] ${extra_pip_requirements} && " +
"touch ${sentinel_file}"
}

task lint(type: Exec, dependsOn: installDev) {
/*
The find/sed combo below is a temporary work-around for the following mypy issue with airflow 2.2.0:
"venv/lib/python3.8/site-packages/airflow/_vendor/connexion/spec.py:169: error: invalid syntax".
*/
commandLine 'bash', '-x', '-c',
commandLine 'bash', '-c',
"find ${venv_name}/lib -path *airflow/_vendor/connexion/spec.py -exec sed -i.bak -e '169,169s/ # type: List\\[str\\]//g' {} \\; && " +
"source ${venv_name}/bin/activate && black --check --diff src/ tests/ && isort --check --diff src/ tests/ && flake8 --count --statistics src/ tests/ && mypy src/ tests/"
"source ${venv_name}/bin/activate && set -x && " +
"black --check --diff src/ tests/ && " +
"isort --check --diff src/ tests/ && " +
"flake8 --count --statistics src/ tests/ && " +
"mypy --show-traceback --show-error-codes src/ tests/"
}
task lintFix(type: Exec, dependsOn: installDev) {
commandLine 'bash', '-x', '-c',
Expand All @@ -58,21 +73,13 @@ task lintFix(type: Exec, dependsOn: installDev) {
"mypy src/ tests/ "
}

task testQuick(type: Exec, dependsOn: installDev) {
// We can't enforce the coverage requirements if we run a subset of the tests.
inputs.files(project.fileTree(dir: "src/", include: "**/*.py"))
inputs.files(project.fileTree(dir: "tests/"))
outputs.dir("${venv_name}")
commandLine 'bash', '-x', '-c',
"source ${venv_name}/bin/activate && pytest -vv --continue-on-collection-errors --junit-xml=junit.quick.xml"
}

task installDevTest(type: Exec, dependsOn: [installDev]) {
def sentinel_file = "${venv_name}/.build_install_dev_test_sentinel"
inputs.file file('setup.py')
outputs.dir("${venv_name}")
outputs.file("${venv_name}/.build_install_dev_test_sentinel")
outputs.file("${sentinel_file}")
commandLine 'bash', '-x', '-c',
"${pip_install_command} -e .[dev,integration-tests] && touch ${venv_name}/.build_install_dev_test_sentinel"
"${pip_install_command} -e .[dev,integration-tests] && touch ${sentinel_file}"
}

def testFile = hasProperty('testFile') ? testFile : 'unknown'
Expand All @@ -89,6 +96,16 @@ task testSingle(dependsOn: [installDevTest]) {
}
}

task testQuick(type: Exec, dependsOn: installDevTest) {
// We can't enforce the coverage requirements if we run a subset of the tests.
inputs.files(project.fileTree(dir: "src/", include: "**/*.py"))
inputs.files(project.fileTree(dir: "tests/"))
outputs.dir("${venv_name}")
commandLine 'bash', '-x', '-c',
"source ${venv_name}/bin/activate && pytest -vv --continue-on-collection-errors --junit-xml=junit.quick.xml"
}


task testFull(type: Exec, dependsOn: [testQuick, installDevTest]) {
commandLine 'bash', '-x', '-c',
"source ${venv_name}/bin/activate && pytest -m 'not slow_integration' -vv --continue-on-collection-errors --junit-xml=junit.full.xml"
Expand Down
1 change: 0 additions & 1 deletion metadata-ingestion-modules/airflow-plugin/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ extend-exclude = '''
^/tmp
'''
include = '\.pyi?$'
target-version = ['py36', 'py37', 'py38']

[tool.isort]
indent = ' '
Expand Down
4 changes: 3 additions & 1 deletion metadata-ingestion-modules/airflow-plugin/setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -69,4 +69,6 @@ exclude_lines =
pragma: no cover
@abstract
if TYPE_CHECKING:
#omit =
omit =
# omit example dags
src/datahub_airflow_plugin/example_dags/*
24 changes: 18 additions & 6 deletions metadata-ingestion-modules/airflow-plugin/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,21 @@ def get_long_description():
return pathlib.Path(os.path.join(root, "README.md")).read_text()


rest_common = {"requests", "requests_file"}

base_requirements = {
# Compatibility.
"dataclasses>=0.6; python_version < '3.7'",
"typing_extensions>=3.10.0.2",
# Typing extension should be >=3.10.0.2 ideally but we can't restrict due to Airflow 2.0.2 dependency conflict
"typing_extensions>=3.7.4.3 ; python_version < '3.8'",
"typing_extensions>=3.10.0.2,<4.6.0 ; python_version >= '3.8'",
"mypy_extensions>=0.4.3",
# Actual dependencies.
"typing-inspect",
"pydantic>=1.5.1",
"apache-airflow >= 2.0.2",
f"acryl-datahub[airflow] == {package_metadata['__version__']}",
*rest_common,
f"acryl-datahub == {package_metadata['__version__']}",
}


Expand All @@ -47,19 +52,18 @@ def get_long_description():
base_dev_requirements = {
*base_requirements,
*mypy_stubs,
"black>=21.12b0",
"black==22.12.0",
"coverage>=5.1",
"flake8>=3.8.3",
"flake8-tidy-imports>=4.3.0",
"isort>=5.7.0",
"mypy>=0.920",
"mypy>=1.4.0",
# pydantic 1.8.2 is incompatible with mypy 0.910.
# See https://github.com/samuelcolvin/pydantic/pull/3175#issuecomment-995382910.
"pydantic>=1.9.0",
"pydantic>=1.10",
"pytest>=6.2.2",
"pytest-asyncio>=0.16.0",
"pytest-cov>=2.8.1",
"pytest-docker>=0.10.3,<0.12",
"tox",
"deepdiff",
"requests-mock",
Expand Down Expand Up @@ -127,5 +131,13 @@ def get_long_description():
"datahub-kafka": [
f"acryl-datahub[datahub-kafka] == {package_metadata['__version__']}"
],
"integration-tests": [
f"acryl-datahub[datahub-kafka] == {package_metadata['__version__']}",
# Extra requirements for Airflow.
"apache-airflow[snowflake]>=2.0.2", # snowflake is used in example dags
# Because of https://github.com/snowflakedb/snowflake-sqlalchemy/issues/350 we need to restrict SQLAlchemy's max version.
"SQLAlchemy<1.4.42",
"virtualenv", # needed by PythonVirtualenvOperator
],
},
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# This module must be imported before any Airflow imports in any of our files.
# The AIRFLOW_PATCHED just helps avoid flake8 errors.

from datahub.utilities._markupsafe_compat import MARKUPSAFE_PATCHED

assert MARKUPSAFE_PATCHED

AIRFLOW_PATCHED = True

__all__ = [
"AIRFLOW_PATCHED",
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from airflow.models.baseoperator import BaseOperator

from datahub_airflow_plugin._airflow_compat import AIRFLOW_PATCHED

try:
from airflow.models.mappedoperator import MappedOperator
from airflow.models.operator import Operator
from airflow.operators.empty import EmptyOperator
except ModuleNotFoundError:
# Operator isn't a real class, but rather a type alias defined
# as the union of BaseOperator and MappedOperator.
# Since older versions of Airflow don't have MappedOperator, we can just use BaseOperator.
Operator = BaseOperator # type: ignore
MappedOperator = None # type: ignore
from airflow.operators.dummy import DummyOperator as EmptyOperator # type: ignore

try:
from airflow.sensors.external_task import ExternalTaskSensor
except ImportError:
from airflow.sensors.external_task_sensor import ExternalTaskSensor # type: ignore

assert AIRFLOW_PATCHED

__all__ = [
"Operator",
"MappedOperator",
"EmptyOperator",
"ExternalTaskSensor",
]
Loading

0 comments on commit e867dbc

Please sign in to comment.