forked from datahub-project/datahub
-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(spark/openlineage): Use Openlineage 1.13.1 in Spark Plugin (data…
…hub-project#10433) - Use Openlineage 1.13.1 in Spark Plugin - Add retry option to datahub client and Spark Plugin - Add OpenLineage integration doc
- Loading branch information
Showing
25 changed files
with
785 additions
and
1,195 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# OpenLineage | ||
|
||
DataHub, now supports [OpenLineage](https://openlineage.io/) integration. With this support, DataHub can ingest and display lineage information from various data processing frameworks, providing users with a comprehensive understanding of their data pipelines. | ||
|
||
## Features | ||
|
||
- **REST Endpoint Support**: DataHub now includes a REST endpoint that can understand OpenLineage events. This allows users to send lineage information directly to DataHub, enabling easy integration with various data processing frameworks. | ||
|
||
- **[Spark Event Listener Plugin](https://datahubproject.io/docs/metadata-integration/java/spark-lineage-beta)**: DataHub provides a Spark Event Listener plugin that seamlessly integrates OpenLineage's Spark plugin. This plugin enhances DataHub's OpenLineage support by offering additional features such as PathSpec support, column-level lineage, patch support and more. | ||
|
||
## OpenLineage Support with DataHub | ||
|
||
### 1. REST Endpoint Support | ||
|
||
DataHub's REST endpoint allows users to send OpenLineage events directly to DataHub. This enables easy integration with various data processing frameworks, providing users with a centralized location for viewing and managing data lineage information. | ||
|
||
With Spark and Airflow we recommend using the Spark Lineage or DataHub's Airflow plugin for tighter integration with DataHub. | ||
|
||
#### How to Use | ||
|
||
To send OpenLineage messages to DataHub using the REST endpoint, simply make a POST request to the following endpoint: | ||
|
||
``` | ||
POST GMS_SERVER_HOST:GMS_PORT/api/v2/lineage | ||
``` | ||
|
||
Include the OpenLineage message in the request body in JSON format. | ||
|
||
Example: | ||
|
||
```json | ||
{ | ||
"eventType": "START", | ||
"eventTime": "2020-12-28T19:52:00.001+10:00", | ||
"run": { | ||
"runId": "d46e465b-d358-4d32-83d4-df660ff614dd" | ||
}, | ||
"job": { | ||
"namespace": "workshop", | ||
"name": "process_taxes" | ||
}, | ||
"inputs": [ | ||
{ | ||
"namespace": "postgres://workshop-db:None", | ||
"name": "workshop.public.taxes", | ||
"facets": { | ||
"dataSource": { | ||
"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", | ||
"_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/DataSourceDatasetFacet", | ||
"name": "postgres://workshop-db:None", | ||
"uri": "workshop-db" | ||
} | ||
} | ||
} | ||
], | ||
"producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client" | ||
} | ||
``` | ||
##### How to set up Airflow | ||
Follow the Airflow guide to setup the Airflow DAGs to send lineage information to DataHub. The guide can be found [here](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html | ||
The transport should look like this: | ||
```json | ||
{"type": "http", | ||
"url": "https://GMS_SERVER_HOST:GMS_PORT/openapi/openlineage/", | ||
"endpoint": "api/v1/lineage", | ||
"auth": { | ||
"type": "api_key", | ||
"api_key": "your-datahub-api-key" | ||
} | ||
} | ||
``` | ||
|
||
#### Known Limitations | ||
With Spark and Airflow we recommend using the Spark Lineage or DataHub's Airflow plugin for tighter integration with DataHub. | ||
|
||
- **[PathSpec](https://datahubproject.io/docs/metadata-integration/java/spark-lineage-beta/#configuring-hdfs-based-dataset-urns) Support**: While the REST endpoint supports OpenLineage messages, full [PathSpec](https://datahubproject.io/docs/metadata-integration/java/spark-lineage-beta/#configuring-hdfs-based-dataset-urns)) support is not yet available. | ||
|
||
- **Column-level Lineage**: DataHub's current OpenLineage support does not provide full column-level lineage tracking. | ||
- etc... | ||
### 2. Spark Event Listener Plugin | ||
|
||
DataHub's Spark Event Listener plugin enhances OpenLineage support by providing additional features such as PathSpec support, column-level lineage, and more. | ||
|
||
#### How to Use | ||
|
||
Follow the guides of the Spark Lineage plugin page for more information on how to set up the Spark Lineage plugin. The guide can be found [here](https://datahubproject.io/docs/metadata-integration/java/spark-lineage-beta) | ||
|
||
## References | ||
|
||
- [OpenLineage](https://openlineage.io/) | ||
- [DataHub OpenAPI Guide](../api/openapi/openapi-usage-guide.md) | ||
- [DataHub Spark Lineage Plugin](https://datahubproject.io/docs/metadata-integration/java/spark-lineage-beta) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
54 changes: 54 additions & 0 deletions
54
...ava/datahub-client/src/main/java/datahub/client/rest/DatahubHttpRequestRetryStrategy.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
package datahub.client.rest; | ||
|
||
import java.io.IOException; | ||
import java.io.InterruptedIOException; | ||
import java.net.ConnectException; | ||
import java.net.NoRouteToHostException; | ||
import java.net.UnknownHostException; | ||
import java.util.Arrays; | ||
import javax.net.ssl.SSLException; | ||
import lombok.extern.slf4j.Slf4j; | ||
import org.apache.hc.client5.http.impl.DefaultHttpRequestRetryStrategy; | ||
import org.apache.hc.core5.http.ConnectionClosedException; | ||
import org.apache.hc.core5.http.HttpRequest; | ||
import org.apache.hc.core5.http.HttpResponse; | ||
import org.apache.hc.core5.http.HttpStatus; | ||
import org.apache.hc.core5.http.protocol.HttpContext; | ||
import org.apache.hc.core5.util.TimeValue; | ||
|
||
@Slf4j | ||
public class DatahubHttpRequestRetryStrategy extends DefaultHttpRequestRetryStrategy { | ||
public DatahubHttpRequestRetryStrategy() { | ||
this(1, TimeValue.ofSeconds(10)); | ||
} | ||
|
||
public DatahubHttpRequestRetryStrategy(int maxRetries, TimeValue retryInterval) { | ||
super( | ||
maxRetries, | ||
retryInterval, | ||
Arrays.asList( | ||
InterruptedIOException.class, | ||
UnknownHostException.class, | ||
ConnectException.class, | ||
ConnectionClosedException.class, | ||
NoRouteToHostException.class, | ||
SSLException.class), | ||
Arrays.asList( | ||
HttpStatus.SC_TOO_MANY_REQUESTS, | ||
HttpStatus.SC_SERVICE_UNAVAILABLE, | ||
HttpStatus.SC_INTERNAL_SERVER_ERROR)); | ||
} | ||
|
||
@Override | ||
public boolean retryRequest( | ||
HttpRequest request, IOException exception, int execCount, HttpContext context) { | ||
log.warn("Checking if retry is needed: {}", execCount); | ||
return super.retryRequest(request, exception, execCount, context); | ||
} | ||
|
||
@Override | ||
public boolean retryRequest(HttpResponse response, int execCount, HttpContext context) { | ||
log.warn("Retrying request due to error: {}", response); | ||
return super.retryRequest(response, execCount, context); | ||
} | ||
} |
Oops, something went wrong.