Skip to content

Commit

Permalink
fix(ingestion/prefect-plugin): Prefect plugin (datahub-project#10643)
Browse files Browse the repository at this point in the history
Co-authored-by: shubhamjagtap639 <[email protected]>
Co-authored-by: Tamas Nemeth <[email protected]>
  • Loading branch information
3 people authored Aug 29, 2024
1 parent 1a051b1 commit 6204cba
Show file tree
Hide file tree
Showing 29 changed files with 2,567 additions and 4 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/build-and-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,8 @@ jobs:
-x :metadata-ingestion-modules:airflow-plugin:check \
-x :metadata-ingestion-modules:dagster-plugin:build \
-x :metadata-ingestion-modules:dagster-plugin:check \
-x :metadata-ingestion-modules:prefect-plugin:build \
-x :metadata-ingestion-modules:prefect-plugin:check \
-x :metadata-ingestion-modules:gx-plugin:build \
-x :metadata-ingestion-modules:gx-plugin:check \
-x :datahub-frontend:build \
Expand Down Expand Up @@ -138,4 +140,4 @@ jobs:
uses: actions/upload-artifact@v3
with:
name: Event File
path: ${{ github.event_path }}
path: ${{ github.event_path }}
86 changes: 86 additions & 0 deletions .github/workflows/prefect-plugin.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
name: Prefect Plugin
on:
push:
branches:
- master
paths:
- ".github/workflows/prefect-plugin.yml"
- "metadata-ingestion-modules/prefect-plugin/**"
- "metadata-ingestion/**"
- "metadata-models/**"
pull_request:
branches:
- "**"
paths:
- ".github/workflows/prefect-plugin.yml"
- "metadata-ingestion-modules/prefect-plugin/**"
- "metadata-ingestion/**"
- "metadata-models/**"
release:
types: [published]

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
prefect-plugin:
runs-on: ubuntu-latest
env:
SPARK_VERSION: 3.0.3
DATAHUB_TELEMETRY_ENABLED: false
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10"]
include:
- python-version: "3.8"
- python-version: "3.9"
- python-version: "3.10"
fail-fast: false
steps:
- name: Set up JDK 17
uses: actions/setup-java@v3
with:
distribution: "zulu"
java-version: 17
- uses: gradle/gradle-build-action@v2
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
- name: Install dependencies
run: ./metadata-ingestion/scripts/install_deps.sh
- name: Install prefect package
run: ./gradlew :metadata-ingestion-modules:prefect-plugin:lint :metadata-ingestion-modules:prefect-plugin:testQuick
- name: pip freeze show list installed
if: always()
run: source metadata-ingestion-modules/prefect-plugin/venv/bin/activate && pip freeze
- uses: actions/upload-artifact@v3
if: ${{ always() && matrix.python-version == '3.10'}}
with:
name: Test Results (Prefect Plugin ${{ matrix.python-version}})
path: |
**/build/reports/tests/test/**
**/build/test-results/test/**
**/junit.*.xml
!**/binary/**
- name: Upload coverage to Codecov
if: always()
uses: codecov/codecov-action@v3
with:
token: ${{ secrets.CODECOV_TOKEN }}
directory: .
fail_ci_if_error: false
flags: prefect,prefect-${{ matrix.extra_pip_extras }}
name: pytest-prefect-${{ matrix.python-version }}
verbose: true

event-file:
runs-on: ubuntu-latest
steps:
- name: Upload
uses: actions/upload-artifact@v3
with:
name: Event File
path: ${{ github.event_path }}
2 changes: 1 addition & 1 deletion .github/workflows/test-results.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: Test Results

on:
workflow_run:
workflows: ["build & test", "metadata ingestion", "Airflow Plugin", "Dagster Plugin", "GX Plugin"]
workflows: ["build & test", "metadata ingestion", "Airflow Plugin", "Dagster Plugin", "Prefect Plugin", "GX Plugin"]
types:
- completed

Expand Down
1 change: 1 addition & 0 deletions docs-website/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ task yarnGenerate(type: YarnTask, dependsOn: [yarnInstall,
':metadata-ingestion:buildWheel',
':metadata-ingestion-modules:airflow-plugin:buildWheel',
':metadata-ingestion-modules:dagster-plugin:buildWheel',
':metadata-ingestion-modules:prefect-plugin:buildWheel',
':metadata-ingestion-modules:gx-plugin:buildWheel',
]) {
inputs.files(projectMdFiles)
Expand Down
13 changes: 12 additions & 1 deletion docs-website/filterTagIndexes.json
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@
"tags": {
"Platform Type": "Orchestrator",
"Connection Type": "Pull",
"Features": "Stateful Ingestion, UI Ingestion, Status Aspect"
"Features": "Status Aspect"
}
},
{
Expand Down Expand Up @@ -429,6 +429,17 @@
"Features": "Stateful Ingestion, Lower Casing, Status Aspect"
}
},
{
"Path": "docs/lineage/prefect",
"imgPath": "img/logos/platforms/prefect.svg",
"Title": "Prefect",
"Description": "Prefect is a modern workflow orchestration for data and ML engineers.",
"tags": {
"Platform Type": "Orchestrator",
"Connection Type": "Pull",
"Features": "Status Aspect"
}
},
{
"Path": "docs/generated/ingestion/sources/presto",
"imgPath": "img/logos/platforms/presto.svg",
Expand Down
1 change: 1 addition & 0 deletions docs-website/generateDocsDir.ts
Original file line number Diff line number Diff line change
Expand Up @@ -573,6 +573,7 @@ function copy_python_wheels(): void {
"../metadata-ingestion/dist",
"../metadata-ingestion-modules/airflow-plugin/dist",
"../metadata-ingestion-modules/dagster-plugin/dist",
"../metadata-ingestion-modules/prefect-plugin/dist",
"../metadata-ingestion-modules/gx-plugin/dist",
];

Expand Down
6 changes: 6 additions & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,11 @@ module.exports = {
id: "docs/lineage/openlineage",
label: "OpenLineage",
},
{
type: "doc",
id: "docs/lineage/prefect",
label: "Prefect",
},
{
type: "doc",
id: "metadata-integration/java/acryl-spark-lineage/README",
Expand Down Expand Up @@ -917,6 +922,7 @@ module.exports = {
// "metadata-integration/java/openlineage-converter/README"
//"metadata-ingestion-modules/airflow-plugin/README"
//"metadata-ingestion-modules/dagster-plugin/README"
//"metadata-ingestion-modules/prefect-plugin/README"
//"metadata-ingestion-modules/gx-plugin/README"
// "metadata-ingestion/schedule_docs/datahub", // we can delete this
// TODO: change the titles of these, removing the "What is..." portion from the sidebar"
Expand Down
2 changes: 2 additions & 0 deletions docs-website/src/pages/_components/Logos/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ const platformLogos = [
name: "CouchBase",
imageUrl: "/img/logos/platforms/couchbase.svg",
},
{ name: "Dagster", imageUrl: "/img/logos/platforms/dagster.png" },
{ name: "Databricks", imageUrl: "/img/logos/platforms/databricks.png" },
{ name: "DBT", imageUrl: "/img/logos/platforms/dbt.svg" },
{ name: "Deltalake", imageUrl: "/img/logos/platforms/deltalake.svg" },
Expand Down Expand Up @@ -87,6 +88,7 @@ const platformLogos = [
{ name: "Pinot", imageUrl: "/img/logos/platforms/pinot.svg" },
{ name: "PostgreSQL", imageUrl: "/img/logos/platforms/postgres.svg" },
{ name: "PowerBI", imageUrl: "/img/logos/platforms/powerbi.png" },
{ name: "Prefect", imageUrl: "/img/logos/platforms/prefect.svg" },
{ name: "Presto", imageUrl: "/img/logos/platforms/presto.svg" },
{ name: "Protobuf", imageUrl: "/img/logos/platforms/protobuf.png" },
{ name: "Pulsar", imageUrl: "/img/logos/platforms/pulsar.png" },
Expand Down
1 change: 1 addition & 0 deletions docs-website/static/img/logos/platforms/prefect.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
137 changes: 137 additions & 0 deletions docs/lineage/prefect.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Prefect Integration with DataHub

## Overview

DataHub supports integration with Prefect, allowing you to ingest:

- Prefect flow and task metadata
- Flow run and Task run information
- Lineage information (when available)

This integration enables you to track and monitor your Prefect workflows within DataHub, providing a comprehensive view of your data pipeline activities.

## Prefect DataHub Block

### What is a Prefect DataHub Block?

Blocks in Prefect are primitives that enable the storage of configuration and provide an interface for interacting with external systems. The `prefect-datahub` block uses the [DataHub REST](../../metadata-ingestion/sink_docs/datahub.md#datahub-rest) emitter to send metadata events while running Prefect flows.

### Prerequisites

1. Use either Prefect Cloud (recommended) or a self-hosted Prefect server.
2. For Prefect Cloud setup, refer to the [Cloud Quickstart](https://docs.prefect.io/latest/getting-started/quickstart/) guide.
3. For self-hosted Prefect server setup, refer to the [Host Prefect Server](https://docs.prefect.io/latest/guides/host/) guide.
4. Ensure the Prefect API URL is set correctly. Verify using:

```shell
prefect profile inspect
```

5. API URL format:
- Prefect Cloud: `https://api.prefect.cloud/api/accounts/<account_id>/workspaces/<workspace_id>`
- Self-hosted: `http://<host>:<port>/api`

## Setup Instructions

### 1. Installation

Install `prefect-datahub` using pip:

```shell
pip install 'prefect-datahub'
```

Note: Requires Python 3.7+

### 2. Saving Configurations to a Block

Save your configuration to the [Prefect block document store](https://docs.prefect.io/latest/concepts/blocks/#saving-blocks):

```python
from prefect_datahub.datahub_emitter import DatahubEmitter

DatahubEmitter(
datahub_rest_url="http://localhost:8080",
env="PROD",
platform_instance="local_prefect"
).save("MY-DATAHUB-BLOCK")
```

Configuration options:

| Config | Type | Default | Description |
|--------|------|---------|-------------|
| datahub_rest_url | `str` | `http://localhost:8080` | DataHub GMS REST URL |
| env | `str` | `PROD` | Environment for assets (see [FabricType](https://datahubproject.io/docs/graphql/enums/#fabrictype)) |
| platform_instance | `str` | `None` | Platform instance for assets (see [Platform Instances](https://datahubproject.io/docs/platform-instances/)) |

### 3. Using the Block in Prefect Workflows

Load and use the saved block in your Prefect workflows:

```python
from prefect import flow, task
from prefect_datahub.dataset import Dataset
from prefect_datahub.datahub_emitter import DatahubEmitter

datahub_emitter = DatahubEmitter.load("MY-DATAHUB-BLOCK")

@task(name="Transform", description="Transform the data")
def transform(data):
data = data.split(" ")
datahub_emitter.add_task(
inputs=[Dataset("snowflake", "mydb.schema.tableA")],
outputs=[Dataset("snowflake", "mydb.schema.tableC")],
)
return data

@flow(name="ETL flow", description="Extract transform load flow")
def etl():
data = transform("This is data")
datahub_emitter.emit_flow()
```

**Note**: To emit tasks, you must call `emit_flow()`. Otherwise, no metadata will be emitted.

## Concept Mapping

| Prefect Concept | DataHub Concept |
|-----------------|-----------------|
| [Flow](https://docs.prefect.io/latest/concepts/flows/) | [DataFlow](https://datahubproject.io/docs/generated/metamodel/entities/dataflow/) |
| [Flow Run](https://docs.prefect.io/latest/concepts/flows/#flow-runs) | [DataProcessInstance](https://datahubproject.io/docs/generated/metamodel/entities/dataprocessinstance) |
| [Task](https://docs.prefect.io/latest/concepts/tasks/) | [DataJob](https://datahubproject.io/docs/generated/metamodel/entities/datajob/) |
| [Task Run](https://docs.prefect.io/latest/concepts/tasks/#tasks) | [DataProcessInstance](https://datahubproject.io/docs/generated/metamodel/entities/dataprocessinstance) |
| [Task Tag](https://docs.prefect.io/latest/concepts/tasks/#tags) | [Tag](https://datahubproject.io/docs/generated/metamodel/entities/tag/) |

## Validation and Troubleshooting

### Validating the Setup

1. Check the Prefect UI's Blocks menu for the DataHub emitter.
2. Run a Prefect workflow and look for DataHub-related log messages:

```text
Emitting flow to datahub...
Emitting tasks to datahub...
```

### Debugging Common Issues

#### Incorrect Prefect API URL

If the Prefect API URL is incorrect, set it manually:

```shell
prefect config set PREFECT_API_URL='http://127.0.0.1:4200/api'
```

#### DataHub Connection Error

If you encounter a `ConnectionError: HTTPConnectionPool(host='localhost', port=8080)`, ensure that your DataHub GMS service is running.

## Additional Resources

- [Prefect Documentation](https://docs.prefect.io/)
- [DataHub Documentation](https://datahubproject.io/docs/)

For more information or support, please refer to the official Prefect and DataHub documentation or reach out to their respective communities.
Loading

0 comments on commit 6204cba

Please sign in to comment.