-4. Slice and dice the result list by Entity Type, Platfrom, Owner, and more to isolate the relevant dependencies
+4. Slice and dice the result list by Entity Type, Platform, Owner, and more to isolate the relevant dependencies
@@ -92,4 +92,4 @@ We currently limit the list of dependencies to 10,000 records; we suggest applyi
### Related Features
-* [DataHub Lineage](../lineage/lineage-feature-guide.md)
+* [DataHub Lineage](../generated/lineage/lineage-feature-guide.md)
diff --git a/docs/api/tutorials/lineage.md b/docs/api/tutorials/lineage.md
index dc43cb178f949..4baad09099d07 100644
--- a/docs/api/tutorials/lineage.md
+++ b/docs/api/tutorials/lineage.md
@@ -6,7 +6,8 @@ import TabItem from '@theme/TabItem';
## Why Would You Use Lineage?
Lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream.
-For more information about lineage, refer to [About DataHub Lineage](/docs/lineage/lineage-feature-guide.md).
+
+For more information about lineage, refer to [About DataHub Lineage](/docs/generated/lineage/lineage-feature-guide.md).
### Goal Of This Guide
diff --git a/docs/browse.md b/docs/browse.md
deleted file mode 100644
index 55a3b16a0a552..0000000000000
--- a/docs/browse.md
+++ /dev/null
@@ -1,56 +0,0 @@
-import FeatureAvailability from '@site/src/components/FeatureAvailability';
-
-# About DataHub Browse
-
-
-
-Browse is one of the primary entrypoints for discovering different Datasets, Dashboards, Charts and other DataHub Entities.
-
-Browsing is useful for finding data entities based on a hierarchical structure set in the source system. Generally speaking, that hierarchy will contain the following levels:
-
-* Entity Type (Dataset, Dashboard, Chart, etc.)
-* Environment (prod vs. dev)
-* Platform Type (Snowflake, dbt, Looker, etc.)
-* Container (Warehouse, Schema, Folder, etc.)
-* Entity Name
-
-For example, a user can easily browse for Datasets within the PROD Snowflake environment, the long_tail_companions warehouse, and the analytics schema:
-
-
-
-
-
-## Using Browse
-
-Browse is accessible by clicking on an Entity Type on the front page of the DataHub UI.
-
-
-
-
-This will take you into the folder explorer view for browse in which you can drill down to your desired sub categories to find the data you are looking for.
-
-
-
-
-## Additional Resources
-
-### GraphQL
-
-* [browse](../graphql/queries.md#browse)
-* [browsePaths](../graphql/queries.md#browsePaths)
-
-## FAQ and Troubleshooting
-
-**How are BrowsePaths created?**
-
-BrowsePaths are automatically created for ingested entities based on separator characters that appear within an Urn.
-
-**How can I customize browse paths?**
-
-BrowsePaths are an Aspect similar to other components of an Entity. They can be customized by ingesting custom paths for specified Urns.
-
-*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!*
-
-### Related Features
-
-* [Search](./how/search.md)
diff --git a/docs/features/feature-guides/ui-lineage.md b/docs/features/feature-guides/ui-lineage.md
new file mode 100644
index 0000000000000..18e4f77e793b2
--- /dev/null
+++ b/docs/features/feature-guides/ui-lineage.md
@@ -0,0 +1,58 @@
+# Managing Lineage via UI
+
+## Viewing lineage
+The UI shows the latest version of the lineage. The time picker can be used to filter out edges within the latest version to exclude those that were last updated outside of the time window. Selecting time windows in the patch will not show you historical lineages. It will only filter the view of the latest version of the lineage.
+
+## Editing from Lineage Graph View
+
+The first place that you can edit lineage for entities is from the Lineage Visualization screen. Click on the "Lineage" button on the top right of an entity's profile to get to this view.
+
+
+
+
+
+Once you find the entity that you want to edit the lineage of, click on the three-dot menu dropdown to select whether you want to edit lineage in the upstream direction or the downstream direction.
+
+
+
+
+
+If you want to edit upstream lineage for entities downstream of the center node or downstream lineage for entities upstream of the center node, you can simply re-center to focus on the node you want to edit. Once focused on the desired node, you can edit lineage in either direction.
+
+
+
+
+
+### Adding Lineage Edges
+
+Once you click "Edit Upstream" or "Edit Downstream," a modal will open that allows you to manage lineage for the selected entity in the chosen direction. In order to add a lineage edge to a new entity, search for it by name in the provided search bar and select it. Once you're satisfied with everything you've added, click "Save Changes." If you change your mind, you can always cancel or exit without saving the changes you've made.
+
+
+
+
+
+### Removing Lineage Edges
+
+You can remove lineage edges from the same modal used to add lineage edges. Find the edge(s) that you want to remove, and click the "X" on the right side of it. And just like adding, you need to click "Save Changes" to save and if you exit without saving, your changes won't be applied.
+
+
+
+
+
+### Reviewing Changes
+
+Any time lineage is edited manually, we keep track of who made the change and when they made it. You can see this information in the modal where you add and remove edges. If an edge was added manually, a user avatar will be in line with the edge that was added. You can hover over this avatar in order to see who added it and when.
+
+
+
+
+
+## Editing from Lineage Tab
+
+The other place that you can edit lineage for entities is from the Lineage Tab on an entity's profile. Click on the "Lineage" tab in an entity's profile and then find the "Edit" dropdown that allows you to edit upstream or downstream lineage for the given entity.
+
+
+
+
+
+Using the modal from this view will work the same as described above for editing from the Lineage Visualization screen.
\ No newline at end of file
diff --git a/docs/how/add-custom-data-platform.md b/docs/how/add-custom-data-platform.md
index a4ea32af455c1..5dcd423e77569 100644
--- a/docs/how/add-custom-data-platform.md
+++ b/docs/how/add-custom-data-platform.md
@@ -77,7 +77,7 @@ datahub put platform --name MyCustomDataPlatform --display_name "My Custom Data
source:
type: "file"
config:
- filename: "./my-custom-data-platform.json"
+ path: "./my-custom-data-platform.json"
# see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation
sink:
diff --git a/docs/how/add-user-data.md b/docs/how/add-user-data.md
index ea76c97163ddd..035821ab75879 100644
--- a/docs/how/add-user-data.md
+++ b/docs/how/add-user-data.md
@@ -57,7 +57,7 @@ Define an [ingestion recipe](https://datahubproject.io/docs/metadata-ingestion/#
source:
type: "file"
config:
- filename: "./my-user.json"
+ path: "./my-user.json"
# see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation
sink:
diff --git a/docs/how/updating-datahub.md b/docs/how/updating-datahub.md
index 9b19291ee246a..4df8d435cf1c4 100644
--- a/docs/how/updating-datahub.md
+++ b/docs/how/updating-datahub.md
@@ -5,7 +5,10 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
## Next
### Breaking Changes
+
- #8810 - Removed support for SQLAlchemy 1.3.x. Only SQLAlchemy 1.4.x is supported now.
+- #8853 - The Airflow plugin no longer supports Airflow 2.0.x or Python 3.7. See the docs for more details.
+- #8853 - Introduced the Airflow plugin v2. If you're using Airflow 2.3+, the v2 plugin will be enabled by default, and so you'll need to switch your requirements to include `pip install 'acryl-datahub-airflow-plugin[plugin-v2]'`. To continue using the v1 plugin, set the `DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN` environment variable to `true`.
### Potential Downtime
diff --git a/docs/lineage/airflow.md b/docs/lineage/airflow.md
index 49de5352f6d58..19ed1598d4c5a 100644
--- a/docs/lineage/airflow.md
+++ b/docs/lineage/airflow.md
@@ -1,74 +1,137 @@
# Airflow Integration
-DataHub supports integration of
+:::note
-- Airflow Pipeline (DAG) metadata
-- DAG and Task run information as well as
-- Lineage information when present
+If you're looking to schedule DataHub ingestion using Airflow, see the guide on [scheduling ingestion with Airflow](../../metadata-ingestion/schedule_docs/airflow.md).
-You can use either the DataHub Airflow lineage plugin (recommended) or the Airflow lineage backend (deprecated).
+:::
-## Using Datahub's Airflow lineage plugin
+The DataHub Airflow plugin supports:
-:::note
+- Automatic column-level lineage extraction from various operators e.g. `SqlOperator`s (including `MySqlOperator`, `PostgresOperator`, `SnowflakeOperator`, and more), `S3FileTransformOperator`, and a few others.
+- Airflow DAG and tasks, including properties, ownership, and tags.
+- Task run information, including task successes and failures.
+- Manual lineage annotations using `inlets` and `outlets` on Airflow operators.
-The Airflow lineage plugin is only supported with Airflow version >= 2.0.2 or on MWAA with an Airflow version >= 2.0.2.
+There's two actively supported implementations of the plugin, with different Airflow version support.
-If you're using Airflow 1.x, use the Airflow lineage plugin with acryl-datahub-airflow-plugin <= 0.9.1.0.
+| Approach | Airflow Version | Notes |
+| --------- | --------------- | --------------------------------------------------------------------------- |
+| Plugin v2 | 2.3+ | Recommended. Requires Python 3.8+ |
+| Plugin v1 | 2.1+ | No automatic lineage extraction; may not extract lineage if the task fails. |
-:::
+If you're using Airflow older than 2.1, it's possible to use the v1 plugin with older versions of `acryl-datahub-airflow-plugin`. See the [compatibility section](#compatibility) for more details.
-This plugin registers a task success/failure callback on every task with a cluster policy and emits DataHub events from that. This allows this plugin to be able to register both task success as well as failures compared to the older Airflow Lineage Backend which could only support emitting task success.
+
+
-### Setup
+## DataHub Plugin v2
-1. You need to install the required dependency in your airflow.
+### Installation
+
+The v2 plugin requires Airflow 2.3+ and Python 3.8+. If you don't meet these requirements, use the v1 plugin instead.
```shell
-pip install acryl-datahub-airflow-plugin
+pip install 'acryl-datahub-airflow-plugin[plugin-v2]'
```
-:::note
+### Configuration
-The [DataHub Rest](../../metadata-ingestion/sink_docs/datahub.md#datahub-rest) emitter is included in the plugin package by default. To use [DataHub Kafka](../../metadata-ingestion/sink_docs/datahub.md#datahub-kafka) install `pip install acryl-datahub-airflow-plugin[datahub-kafka]`.
+Set up a DataHub connection in Airflow.
-:::
+```shell
+airflow connections add --conn-type 'datahub-rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' --conn-password ''
+```
+
+No additional configuration is required to use the plugin. However, there are some optional configuration parameters that can be set in the `airflow.cfg` file.
+
+```ini title="airflow.cfg"
+[datahub]
+# Optional - additional config here.
+enabled = True # default
+```
+
+| Name | Default value | Description |
+| -------------------------- | -------------------- | ---------------------------------------------------------------------------------------- |
+| enabled | true | If the plugin should be enabled. |
+| conn_id | datahub_rest_default | The name of the datahub rest connection. |
+| cluster | prod | name of the airflow cluster |
+| capture_ownership_info | true | Extract DAG ownership. |
+| capture_tags_info | true | Extract DAG tags. |
+| capture_executions | true | Extract task runs and success/failure statuses. This will show up in DataHub "Runs" tab. |
+| enable_extractors | true | Enable automatic lineage extraction. |
+| disable_openlineage_plugin | true | Disable the OpenLineage plugin to avoid duplicative processing. |
+| log_level | _no change_ | [debug] Set the log level for the plugin. |
+| debug_emitter | false | [debug] If true, the plugin will log the emitted events. |
+
+### Automatic lineage extraction
+
+To automatically extract lineage information, the v2 plugin builds on top of Airflow's built-in [OpenLineage extractors](https://openlineage.io/docs/integrations/airflow/default-extractors).
-2. Disable lazy plugin loading in your airflow.cfg.
- On MWAA you should add this config to your [Apache Airflow configuration options](https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html#configuring-2.0-airflow-override).
+The SQL-related extractors have been updated to use DataHub's SQL parser, which is more robust than the built-in one and uses DataHub's metadata information to generate column-level lineage. We discussed the DataHub SQL parser, including why schema-aware parsing works better and how it performs on benchmarks, during the [June 2023 community town hall](https://youtu.be/1QVcUmRQK5E?si=U27zygR7Gi_KdkzE&t=2309).
+
+## DataHub Plugin v1
+
+### Installation
+
+The v1 plugin requires Airflow 2.1+ and Python 3.8+. If you're on older versions, it's still possible to use an older version of the plugin. See the [compatibility section](#compatibility) for more details.
+
+If you're using Airflow 2.3+, we recommend using the v2 plugin instead. If you need to use the v1 plugin with Airflow 2.3+, you must also set the environment variable `DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true`.
+
+```shell
+pip install 'acryl-datahub-airflow-plugin[plugin-v1]'
+
+# The DataHub rest connection type is included by default.
+# To use the DataHub Kafka connection type, install the plugin with the kafka extras.
+pip install 'acryl-datahub-airflow-plugin[plugin-v1,datahub-kafka]'
+```
+
+
+
+### Configuration
+
+#### Disable lazy plugin loading
```ini title="airflow.cfg"
[core]
lazy_load_plugins = False
```
-3. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
+On MWAA you should add this config to your [Apache Airflow configuration options](https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html#configuring-2.0-airflow-override).
+
+#### Setup a DataHub connection
- ```shell
- # For REST-based:
- airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' --conn-password ''
- # For Kafka-based (standard Kafka sink config can be passed via extras):
- airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
- ```
+You must configure an Airflow connection for Datahub. We support both a Datahub REST and a Kafka-based connections, but you only need one.
-4. Add your `datahub_conn_id` and/or `cluster` to your `airflow.cfg` file if it is not align with the default values. See configuration parameters below
+```shell
+# For REST-based:
+airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' --conn-password ''
+# For Kafka-based (standard Kafka sink config can be passed via extras):
+airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
+```
- **Configuration options:**
+#### Configure the plugin
- | Name | Default value | Description |
- | ------------------------------ | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
- | datahub.enabled | true | If the plugin should be enabled. |
- | datahub.conn_id | datahub_rest_default | The name of the datahub connection you set in step 1. |
- | datahub.cluster | prod | name of the airflow cluster |
- | datahub.capture_ownership_info | true | If true, the owners field of the DAG will be capture as a DataHub corpuser. |
- | datahub.capture_tags_info | true | If true, the tags field of the DAG will be captured as DataHub tags. |
- | datahub.capture_executions | true | If true, we'll capture task runs in DataHub in addition to DAG definitions. |
- | datahub.graceful_exceptions | true | If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions. |
+If your config doesn't align with the default values, you can configure the plugin in your `airflow.cfg` file.
+
+```ini title="airflow.cfg"
+[datahub]
+enabled = true
+conn_id = datahub_rest_default # or datahub_kafka_default
+# etc.
+```
-5. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
-6. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.
+| Name | Default value | Description |
+| ---------------------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| enabled | true | If the plugin should be enabled. |
+| conn_id | datahub_rest_default | The name of the datahub connection you set in step 1. |
+| cluster | prod | name of the airflow cluster |
+| capture_ownership_info | true | If true, the owners field of the DAG will be capture as a DataHub corpuser. |
+| capture_tags_info | true | If true, the tags field of the DAG will be captured as DataHub tags. |
+| capture_executions | true | If true, we'll capture task runs in DataHub in addition to DAG definitions. |
+| graceful_exceptions | true | If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions. |
-### How to validate installation
+#### Validate that the plugin is working
1. Go and check in Airflow at Admin -> Plugins menu if you can see the DataHub plugin
2. Run an Airflow DAG. In the task logs, you should see Datahub related log messages like:
@@ -77,9 +140,22 @@ lazy_load_plugins = False
Emitting DataHub ...
```
-### Emitting lineage via a custom operator to the Airflow Plugin
+## Manual Lineage Annotation
+
+### Using `inlets` and `outlets`
+
+You can manually annotate lineage by setting `inlets` and `outlets` on your Airflow operators. This is useful if you're using an operator that doesn't support automatic lineage extraction, or if you want to override the automatic lineage extraction.
+
+We have a few code samples that demonstrate how to use `inlets` and `outlets`:
-If you have created a custom Airflow operator [docs](https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html) that inherits from the BaseOperator class,
+- [`lineage_backend_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py)
+- [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py) - uses the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html)
+
+For more information, take a look at the [Airflow lineage docs](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html).
+
+### Custom Operators
+
+If you have created a [custom Airflow operator](https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html) that inherits from the BaseOperator class,
when overriding the `execute` function, set inlets and outlets via `context['ti'].task.inlets` and `context['ti'].task.outlets`.
The DataHub Airflow plugin will then pick up those inlets and outlets after the task runs.
@@ -90,7 +166,7 @@ class DbtOperator(BaseOperator):
def execute(self, context):
# do something
inlets, outlets = self._get_lineage()
- # inlets/outlets are lists of either datahub_provider.entities.Dataset or datahub_provider.entities.Urn
+ # inlets/outlets are lists of either datahub_airflow_plugin.entities.Dataset or datahub_airflow_plugin.entities.Urn
context['ti'].task.inlets = self.inlets
context['ti'].task.outlets = self.outlets
@@ -100,78 +176,25 @@ class DbtOperator(BaseOperator):
return inlets, outlets
```
-If you override the `pre_execute` and `post_execute` function, ensure they include the `@prepare_lineage` and `@apply_lineage` decorators respectively. [source](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/lineage.html#lineage)
-
-## Using DataHub's Airflow lineage backend (deprecated)
-
-:::caution
-
-The DataHub Airflow plugin (above) is the recommended way to integrate Airflow with DataHub. For managed services like MWAA, the lineage backend is not supported and so you must use the Airflow plugin.
-
-If you're using Airflow 1.x, we recommend using the Airflow lineage backend with acryl-datahub <= 0.9.1.0.
-
-:::
-
-:::note
-
-If you are looking to run Airflow and DataHub using docker locally, follow the guide [here](../../docker/airflow/local_airflow.md). Otherwise proceed to follow the instructions below.
-:::
-
-### Setting up Airflow to use DataHub as Lineage Backend
-
-1. You need to install the required dependency in your airflow. See
-
-```shell
-pip install acryl-datahub[airflow]
-# If you need the Kafka-based emitter/hook:
-pip install acryl-datahub[airflow,datahub-kafka]
-```
-
-2. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
-
- ```shell
- # For REST-based:
- airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' --conn-password ''
- # For Kafka-based (standard Kafka sink config can be passed via extras):
- airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
- ```
+If you override the `pre_execute` and `post_execute` function, ensure they include the `@prepare_lineage` and `@apply_lineage` decorators respectively. Reference the [Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/lineage.html#lineage) for more details.
-3. Add the following lines to your `airflow.cfg` file.
+## Emit Lineage Directly
- ```ini title="airflow.cfg"
- [lineage]
- backend = datahub_provider.lineage.datahub.DatahubLineageBackend
- datahub_kwargs = {
- "enabled": true,
- "datahub_conn_id": "datahub_rest_default",
- "cluster": "prod",
- "capture_ownership_info": true,
- "capture_tags_info": true,
- "graceful_exceptions": true }
- # The above indentation is important!
- ```
+If you can't use the plugin or annotate inlets/outlets, you can also emit lineage using the `DatahubEmitterOperator`.
- **Configuration options:**
+Reference [`lineage_emission_dag.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_emission_dag.py) for a full example.
- - `datahub_conn_id` (required): Usually `datahub_rest_default` or `datahub_kafka_default`, depending on what you named the connection in step 1.
- - `cluster` (defaults to "prod"): The "cluster" to associate Airflow DAGs and tasks with.
- - `capture_ownership_info` (defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser.
- - `capture_tags_info` (defaults to true): If true, the tags field of the DAG will be captured as DataHub tags.
- - `capture_executions` (defaults to false): If true, it captures task runs as DataHub DataProcessInstances.
- - `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.
+In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See the plugin configuration for examples.
-4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
-5. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.
-
-## Emitting lineage via a separate operator
-
-Take a look at this sample DAG:
+## Debugging
-- [`lineage_emission_dag.py`](../../metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.
+### Missing lineage
-In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.
+If you're not seeing lineage in DataHub, check the following:
-## Debugging
+- Validate that the plugin is loaded in Airflow. Go to Admin -> Plugins and check that the DataHub plugin is listed.
+- If using the v2 plugin's automatic lineage, ensure that the `enable_extractors` config is set to true and that automatic lineage is supported for your operator.
+- If using manual lineage annotation, ensure that you're using the `datahub_airflow_plugin.entities.Dataset` or `datahub_airflow_plugin.entities.Urn` classes for your inlets and outlets.
### Incorrect URLs
@@ -179,9 +202,21 @@ If your URLs aren't being generated correctly (usually they'll start with `http:
```ini title="airflow.cfg"
[webserver]
-base_url = http://airflow.example.com
+base_url = http://airflow.mycorp.example.com
```
+## Compatibility
+
+We no longer officially support Airflow <2.1. However, you can use older versions of `acryl-datahub-airflow-plugin` with older versions of Airflow.
+Both of these options support Python 3.7+.
+
+- Airflow 1.10.x, use DataHub plugin v1 with acryl-datahub-airflow-plugin <= 0.9.1.0.
+- Airflow 2.0.x, use DataHub plugin v1 with acryl-datahub-airflow-plugin <= 0.11.0.1.
+
+DataHub also previously supported an Airflow [lineage backend](https://airflow.apache.org/docs/apache-airflow/2.2.0/lineage.html#lineage-backend) implementation. While the implementation is still in our codebase, it is deprecated and will be removed in a future release.
+Note that the lineage backend did not support automatic lineage extraction, did not capture task failures, and did not work in AWS MWAA.
+The [documentation for the lineage backend](https://docs-website-1wmaehubl-acryldata.vercel.app/docs/lineage/airflow/#using-datahubs-airflow-lineage-backend-deprecated) has already been archived.
+
## Additional references
Related Datahub videos:
diff --git a/docs/lineage/lineage-feature-guide.md b/docs/lineage/lineage-feature-guide.md
deleted file mode 100644
index 678afce4c46a0..0000000000000
--- a/docs/lineage/lineage-feature-guide.md
+++ /dev/null
@@ -1,222 +0,0 @@
-import FeatureAvailability from '@site/src/components/FeatureAvailability';
-
-# About DataHub Lineage
-
-
-
-Lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream.
-
-If you're using an ingestion source that supports extraction of Lineage (e.g. the "Table Lineage Capability"), then lineage information can be extracted automatically. For detailed instructions, refer to the source documentation for the source you are using. If you are not using a Lineage-support ingestion source, you can programmatically emit lineage edges between entities via API.
-
-Alternatively, as of `v0.9.5`, DataHub supports the manual editing of lineage between entities. Data experts are free to add or remove upstream and downstream lineage edges in both the Lineage Visualization screen as well as the Lineage tab on entity pages. Use this feature to supplement automatic lineage extraction or establish important entity relationships in sources that do not support automatic extraction. Editing lineage by hand is supported for Datasets, Charts, Dashboards, and Data Jobs.
-
-:::note
-
-Lineage added by hand and programmatically may conflict with one another to cause unwanted overwrites. It is strongly recommend that lineage is edited manually in cases where lineage information is not also extracted in automated fashion, e.g. by running an ingestion source.
-
-:::
-
-Types of lineage connections supported in DataHub are:
-
-* Dataset-to-dataset
-* Pipeline lineage (dataset-to-job-to-dataset)
-* Dashboard-to-chart lineage
-* Chart-to-dataset lineage
-* Job-to-dataflow (dbt lineage)
-
-## Lineage Setup, Prerequisites, and Permissions
-
-To edit lineage for an entity, you'll need the following [Metadata Privilege](../authorization/policies.md):
-
-* **Edit Lineage** metadata privilege to edit lineage at the entity level
-
-It is important to know that the **Edit Lineage** privilege is required for all entities whose lineage is affected by the changes. For example, in order to add "Dataset B" as an upstream dependency of "Dataset A", you'll need the **Edit Lineage** privilege for both Dataset A and Dataset B.
-
-## Managing Lineage via the DataHub UI
-
-### Viewing lineage on the Datahub UI
-The UI shows the latest version of the lineage. The time picker can be used to filter out edges within the latest version to exclude those that were last updated outside of the time window. Selecting time windows in the patch will not show you historical lineages. It will only filter the view of the latest version of the lineage.
-
-### Editing from Lineage Graph View
-
-The first place that you can edit lineage for entities is from the Lineage Visualization screen. Click on the "Lineage" button on the top right of an entity's profile to get to this view.
-
-
-
-
-
-Once you find the entity that you want to edit the lineage of, click on the three-dot menu dropdown to select whether you want to edit lineage in the upstream direction or the downstream direction.
-
-
-
-
-
-If you want to edit upstream lineage for entities downstream of the center node or downstream lineage for entities upstream of the center node, you can simply re-center to focus on the node you want to edit. Once focused on the desired node, you can edit lineage in either direction.
-
-
-
-
-
-#### Adding Lineage Edges
-
-Once you click "Edit Upstream" or "Edit Downstream," a modal will open that allows you to manage lineage for the selected entity in the chosen direction. In order to add a lineage edge to a new entity, search for it by name in the provided search bar and select it. Once you're satisfied with everything you've added, click "Save Changes." If you change your mind, you can always cancel or exit without saving the changes you've made.
-
-
-
-
-
-#### Removing Lineage Edges
-
-You can remove lineage edges from the same modal used to add lineage edges. Find the edge(s) that you want to remove, and click the "X" on the right side of it. And just like adding, you need to click "Save Changes" to save and if you exit without saving, your changes won't be applied.
-
-
-
-
-
-#### Reviewing Changes
-
-Any time lineage is edited manually, we keep track of who made the change and when they made it. You can see this information in the modal where you add and remove edges. If an edge was added manually, a user avatar will be in line with the edge that was added. You can hover over this avatar in order to see who added it and when.
-
-
-
-
-
-### Editing from Lineage Tab
-
-The other place that you can edit lineage for entities is from the Lineage Tab on an entity's profile. Click on the "Lineage" tab in an entity's profile and then find the "Edit" dropdown that allows you to edit upstream or downstream lineage for the given entity.
-
-
-
-
-
-Using the modal from this view will work the same as described above for editing from the Lineage Visualization screen.
-
-## Managing Lineage via API
-
-:::note
-
- When you emit any lineage aspect, the existing aspect gets completely overwritten, unless specifically using patch semantics.
-This means that the latest version visible in the UI will be your version.
-
-:::
-
-### Using Dataset-to-Dataset Lineage
-
-This relationship model uses dataset -> dataset connection through the UpstreamLineage aspect in the Dataset entity.
-
-Here are a few samples for the usage of this type of lineage:
-
-* [lineage_emitter_mcpw_rest.py](../../metadata-ingestion/examples/library/lineage_emitter_mcpw_rest.py) - emits simple bigquery table-to-table (dataset-to-dataset) lineage via REST as MetadataChangeProposalWrapper.
-* [lineage_emitter_rest.py](../../metadata-ingestion/examples/library/lineage_emitter_rest.py) - emits simple dataset-to-dataset lineage via REST as MetadataChangeEvent.
-* [lineage_emitter_kafka.py](../../metadata-ingestion/examples/library/lineage_emitter_kafka.py) - emits simple dataset-to-dataset lineage via Kafka as MetadataChangeEvent.
-* [lineage_emitter_dataset_finegrained.py](../../metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained.py) - emits fine-grained dataset-dataset lineage via REST as MetadataChangeProposalWrapper.
-* [Datahub Snowflake Lineage](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py) - emits Datahub's Snowflake lineage as MetadataChangeProposalWrapper.
-* [Datahub BigQuery Lineage](https://github.com/datahub-project/datahub/blob/3022c2d12e68d221435c6134362c1a2cba2df6b3/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py#L1028) - emits Datahub's Bigquery lineage as MetadataChangeProposalWrapper. **Use the patch feature to add to rather than overwrite the current lineage.**
-
-### Using dbt Lineage
-
-This model captures dbt specific nodes (tables, views, etc.) and
-
-* uses datasets as the base entity type and
-* extends subclass datasets for each dbt-specific concept, and
-* links them together for dataset-to-dataset lineage
-
-Here is a sample usage of this lineage:
-
-* [Datahub dbt Lineage](https://github.com/datahub-project/datahub/blob/a9754ebe83b6b73bc2bfbf49d9ebf5dbd2ca5a8f/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L625,L630) - emits Datahub's dbt lineage as MetadataChangeEvent.
-
-### Using Pipeline Lineage
-
-The relationship model for this is datajob-to-dataset through the dataJobInputOutput aspect in the DataJob entity.
-
-For Airflow, this lineage is supported using Airflow’s lineage backend which allows you to specify the inputs to and output from that task.
-
-If you annotate that on your task we can pick up that information and push that as lineage edges into datahub automatically. You can install this package from Airflow’s Astronomer marketplace [here](https://registry.astronomer.io/providers/datahub).
-
-Here are a few samples for the usage of this type of lineage:
-
-* [lineage_dataset_job_dataset.py](../../metadata-ingestion/examples/library/lineage_dataset_job_dataset.py) - emits mysql-to-airflow-to-kafka (dataset-to-job-to-dataset) lineage via REST as MetadataChangeProposalWrapper.
-* [lineage_job_dataflow.py](../../metadata-ingestion/examples/library/lineage_job_dataflow.py) - emits the job-to-dataflow lineage via REST as MetadataChangeProposalWrapper.
-
-### Using Dashboard-to-Chart Lineage
-
-This relationship model uses the dashboardInfo aspect of the Dashboard entity and models an explicit edge between a dashboard and a chart (such that charts can be attached to multiple dashboards).
-
-Here is a sample usage of this lineage:
-
-* [lineage_chart_dashboard.py](../../metadata-ingestion/examples/library/lineage_chart_dashboard.py) - emits the chart-to-dashboard lineage via REST as MetadataChangeProposalWrapper.
-
-### Using Chart-to-Dataset Lineage
-
-This relationship model uses the chartInfo aspect of the Chart entity.
-
-Here is a sample usage of this lineage:
-
-* [lineage_dataset_chart.py](../../metadata-ingestion/examples/library/lineage_dataset_chart.py) - emits the dataset-to-chart lineage via REST as MetadataChangeProposalWrapper.
-
-## Additional Resources
-
-### Videos
-
-**DataHub Basics: Lineage 101**
-
-
-
-
-
-**DataHub November 2022 Town Hall - Including Manual Lineage Demo**
-
-
-
-
-
-### GraphQL
-
-* [updateLineage](../../graphql/mutations.md#updatelineage)
-* [searchAcrossLineage](../../graphql/queries.md#searchacrosslineage)
-* [searchAcrossLineageInput](../../graphql/inputObjects.md#searchacrosslineageinput)
-
-#### Examples
-
-**Updating Lineage**
-
-```graphql
-mutation updateLineage {
- updateLineage(input: {
- edgesToAdd: [
- {
- downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)",
- upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:datahub,Dataset,PROD)"
- }
- ],
- edgesToRemove: [
- {
- downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)",
- upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)"
- }
- ]
- })
-}
-```
-
-### DataHub Blog
-
-* [Acryl Data introduces lineage support and automated propagation of governance information for Snowflake in DataHub](https://blog.datahubproject.io/acryl-data-introduces-lineage-support-and-automated-propagation-of-governance-information-for-339c99536561)
-* [Data in Context: Lineage Explorer in DataHub](https://blog.datahubproject.io/data-in-context-lineage-explorer-in-datahub-a53a9a476dc4)
-* [Harnessing the Power of Data Lineage with DataHub](https://blog.datahubproject.io/harnessing-the-power-of-data-lineage-with-datahub-ad086358dec4)
-
-## FAQ and Troubleshooting
-
-**The Lineage Tab is greyed out - why can’t I click on it?**
-
-This means you have not yet ingested lineage metadata for that entity. Please ingest lineage to proceed.
-
-**Are there any recommended practices for emitting lineage?**
-
-We recommend emitting aspects as MetadataChangeProposalWrapper over emitting them via the MetadataChangeEvent.
-
-*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!*
-
-### Related Features
-
-* [DataHub Lineage Impact Analysis](../act-on-metadata/impact-analysis.md)
diff --git a/docs/ownership/ownership-types.md b/docs/ownership/ownership-types.md
index f1b951871a5a2..dbb08dd71ce6b 100644
--- a/docs/ownership/ownership-types.md
+++ b/docs/ownership/ownership-types.md
@@ -7,7 +7,7 @@ import TabItem from '@theme/TabItem';
**🤝 Version compatibility**
-> Open Source DataHub: **0.10.3** | Acryl: **0.2.8**
+> Open Source DataHub: **0.10.4** | Acryl: **0.2.8**
## What are Custom Ownership Types?
Custom Ownership Types are an improvement on the way to establish ownership relationships between users and the data assets they manage within DataHub.
@@ -85,7 +85,7 @@ source:
type: "file"
config:
# path to json file
- filename: "metadata-ingestion/examples/ownership/ownership_type.json"
+ path: "metadata-ingestion/examples/ownership/ownership_type.json"
# see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation
sink:
diff --git a/docs/saas.md b/docs/saas.md
index 35dde5b1ca9a9..de57b5617e062 100644
--- a/docs/saas.md
+++ b/docs/saas.md
@@ -5,10 +5,10 @@ Sign up for fully managed, hassle-free and secure SaaS service for DataHub, prov
+
+The UI shows the latest version of the lineage. The time picker can be used to filter out edges within the latest version to exclude those that were last updated outside of the time window. Selecting time windows in the patch will not show you historical lineages. It will only filter the view of the latest version of the lineage.
+
+
+
+
+
+
+:::tip The Lineage Tab is greyed out - why can’t I click on it?
+This means you have not yet ingested lineage metadata for that entity. Please ingest lineage to proceed.
+
+:::
+
+## Adding Lineage
+
+### Ingestion Source
+
+If you're using an ingestion source that supports extraction of Lineage (e.g. **Table Lineage Capability**), then lineage information can be extracted automatically.
+For detailed instructions, refer to the [source documentation](https://datahubproject.io/integrations) for the source you are using.
+
+### UI
+
+As of `v0.9.5`, DataHub supports the manual editing of lineage between entities. Data experts are free to add or remove upstream and downstream lineage edges in both the Lineage Visualization screen as well as the Lineage tab on entity pages. Use this feature to supplement automatic lineage extraction or establish important entity relationships in sources that do not support automatic extraction. Editing lineage by hand is supported for Datasets, Charts, Dashboards, and Data Jobs.
+Please refer to our [UI Guides on Lineage](../../features/feature-guides/ui-lineage.md) for more information.
+
+:::caution Recommendation on UI-based lineage
+
+Lineage added by hand and programmatically may conflict with one another to cause unwanted overwrites.
+It is strongly recommend that lineage is edited manually in cases where lineage information is not also extracted in automated fashion, e.g. by running an ingestion source.
+
+:::
+
+### API
+
+If you are not using a Lineage-support ingestion source, you can programmatically emit lineage edges between entities via API.
+Please refer to [API Guides on Lineage](../../api/tutorials/lineage.md) for more information.
+
+
+## Lineage Support
+
+### Automatic Lineage Extraction Support
+
+This is a summary of automatic lineage extraciton support in our data source. Please refer to the **Important Capabilities** table in the source documentation. Note that even if the source does not support automatic extraction, you can still add lineage manually using our API & SDKs.\n""")
+
+ f.write("\n| Source | Table-Level Lineage | Column-Level Lineage | Related Configs |\n")
+ f.write("| ---------- | ------ | ----- |----- |\n")
+
+ for platform_id, platform_docs in sorted(
+ source_documentation.items(),
+ key=lambda x: (x[1]["name"].casefold(), x[1]["name"])
+ if "name" in x[1]
+ else (x[0].casefold(), x[0]),
+ ):
+ for plugin, plugin_docs in sorted(
+ platform_docs["plugins"].items(),
+ key=lambda x: str(x[1].get("doc_order"))
+ if x[1].get("doc_order")
+ else x[0],
+ ):
+ platform_name = platform_docs['name']
+ if len(platform_docs["plugins"].keys()) > 1:
+ # We only need to show this if there are multiple modules.
+ platform_name = f"{platform_name} `{plugin}`"
+
+ # Initialize variables
+ table_level_supported = "❌"
+ column_level_supported = "❌"
+ config_names = ''
+
+ if "capabilities" in plugin_docs:
+ plugin_capabilities = plugin_docs["capabilities"]
+
+ for cap_setting in plugin_capabilities:
+ capability_text = get_capability_text(cap_setting.capability)
+ capability_supported = get_capability_supported_badge(cap_setting.supported)
+
+ if capability_text == "Table-Level Lineage" and capability_supported == "✅":
+ table_level_supported = "✅"
+
+ if capability_text == "Column-level Lineage" and capability_supported == "✅":
+ column_level_supported = "✅"
+
+ if not (table_level_supported == "❌" and column_level_supported == "❌"):
+ if "config_schema" in plugin_docs:
+ config_properties = json.loads(plugin_docs['config_schema']).get('properties', {})
+ config_names = ' '.join(
+ [f'- {property_name}' for property_name in config_properties if 'lineage' in property_name])
+ lineage_not_applicable_sources = ['azure-ad', 'csv', 'demo-data', 'dynamodb', 'iceberg', 'json-schema', 'ldap', 'openapi', 'pulsar', 'sqlalchemy' ]
+ if platform_id not in lineage_not_applicable_sources :
+ f.write(
+ f"| [{platform_name}](../../generated/ingestion/sources/{platform_id}.md) | {table_level_supported} | {column_level_supported} | {config_names}|\n"
+ )
+
+ f.write("""
+
+### Types of Lineage Connections
+
+Types of lineage connections supported in DataHub and the example codes are as follows.
+
+| Connection | Examples | A.K.A |
+|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
+| Dataset to Dataset | - [lineage_emitter_mcpw_rest.py](../../../metadata-ingestion/examples/library/lineage_emitter_mcpw_rest.py) - [lineage_emitter_rest.py](../../../metadata-ingestion/examples/library/lineage_emitter_rest.py) - [lineage_emitter_kafka.py](../../../metadata-ingestion/examples/library/lineage_emitter_kafka.py) - [lineage_emitter_dataset_finegrained.py](../../../metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained.py) - [Datahub BigQuery Lineage](https://github.com/datahub-project/datahub/blob/a1bf95307b040074c8d65ebb86b5eb177fdcd591/metadata-ingestion/src/datahub/ingestion/source/sql/bigquery.py#L229) - [Datahub Snowflake Lineage](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/snowflake.py#L249) |
+| DataJob to DataFlow | - [lineage_job_dataflow.py](../../../metadata-ingestion/examples/library/lineage_job_dataflow.py) | |
+| DataJob to Dataset | - [lineage_dataset_job_dataset.py](../../../metadata-ingestion/examples/library/lineage_dataset_job_dataset.py) | Pipeline Lineage |
+| Chart to Dashboard | - [lineage_chart_dashboard.py](../../../metadata-ingestion/examples/library/lineage_chart_dashboard.py) | |
+| Chart to Dataset | - [lineage_dataset_chart.py](../../../metadata-ingestion/examples/library/lineage_dataset_chart.py) | |
+
+
+:::tip Our Roadmap
+We're actively working on expanding lineage support for new data sources.
+Visit our [Official Roadmap](https://feature-requests.datahubproject.io/roadmap) for upcoming updates!
+:::
+
+## References
+
+- [DataHub Basics: Lineage 101](https://www.youtube.com/watch?v=rONGpsndzRw&t=1s)
+- [DataHub November 2022 Town Hall](https://www.youtube.com/watch?v=BlCLhG8lGoY&t=1s) - Including Manual Lineage Demo
+- [Acryl Data introduces lineage support and automated propagation of governance information for Snowflake in DataHub](https://blog.datahubproject.io/acryl-data-introduces-lineage-support-and-automated-propagation-of-governance-information-for-339c99536561)
+- [Data in Context: Lineage Explorer in DataHub](https://blog.datahubproject.io/data-in-context-lineage-explorer-in-datahub-a53a9a476dc4)
+- [Harnessing the Power of Data Lineage with DataHub](https://blog.datahubproject.io/harnessing-the-power-of-data-lineage-with-datahub-ad086358dec4)
+- [DataHub Lineage Impact Analysis](https://datahubproject.io/docs/next/act-on-metadata/impact-analysis)
+ """)
+
+ print("Lineage Documentation Generation Complete")
if __name__ == "__main__":
logger.setLevel("INFO")
diff --git a/metadata-ingestion/setup.cfg b/metadata-ingestion/setup.cfg
index fad55b99ec938..8b78e4d3c9c6f 100644
--- a/metadata-ingestion/setup.cfg
+++ b/metadata-ingestion/setup.cfg
@@ -75,10 +75,11 @@ disallow_untyped_defs = yes
asyncio_mode = auto
addopts = --cov=src --cov-report= --cov-config setup.cfg --strict-markers
markers =
- slow_unit: marks tests to only run slow unit tests (deselect with '-m not slow_unit')
- integration: marks tests to only run in integration (deselect with '-m "not integration"')
- integration_batch_1: mark tests to only run in batch 1 of integration tests. This is done mainly for parallelisation (deselect with '-m not integration_batch_1')
- slow_integration: marks tests that are too slow to even run in integration (deselect with '-m "not slow_integration"')
+ slow: marks tests that are slow to run, including all docker-based tests (deselect with '-m not slow')
+ integration: marks all integration tests, across all batches (deselect with '-m "not integration"')
+ integration_batch_0: mark tests to run in batch 0 of integration tests. This is done mainly for parallelisation in CI. Batch 0 is the default batch.
+ integration_batch_1: mark tests to run in batch 1 of integration tests
+ integration_batch_2: mark tests to run in batch 2 of integration tests
testpaths =
tests/unit
tests/integration
diff --git a/metadata-ingestion/setup.py b/metadata-ingestion/setup.py
index 65deadf16a5b3..34afa8cdb39a4 100644
--- a/metadata-ingestion/setup.py
+++ b/metadata-ingestion/setup.py
@@ -1,4 +1,3 @@
-import os
import sys
from typing import Dict, Set
@@ -9,16 +8,9 @@
exec(fp.read(), package_metadata)
-def get_long_description():
- root = os.path.dirname(__file__)
- with open(os.path.join(root, "README.md")) as f:
- description = f.read()
-
- return description
-
-
base_requirements = {
- "typing_extensions>=3.10.0.2",
+ # Typing extension should be >=3.10.0.2 ideally but we can't restrict due to a Airflow 2.1 dependency conflict.
+ "typing_extensions>=3.7.4.3",
"mypy_extensions>=0.4.3",
# Actual dependencies.
"typing-inspect",
@@ -258,7 +250,7 @@ def get_long_description():
databricks = {
# 0.1.11 appears to have authentication issues with azure databricks
- "databricks-sdk>=0.1.1, <0.1.11",
+ "databricks-sdk>=0.1.1, != 0.1.11",
"pyspark",
"requests",
}
@@ -270,6 +262,7 @@ def get_long_description():
# Sink plugins.
"datahub-kafka": kafka_common,
"datahub-rest": rest_common,
+ "sync-file-emitter": {"filelock"},
"datahub-lite": {
"duckdb",
"fastapi",
@@ -470,6 +463,7 @@ def get_long_description():
*list(
dependency
for plugin in [
+ "athena",
"bigquery",
"clickhouse",
"clickhouse-usage",
@@ -492,6 +486,7 @@ def get_long_description():
"kafka",
"datahub-rest",
"datahub-lite",
+ "great-expectations",
"presto",
"redash",
"redshift",
@@ -530,6 +525,7 @@ def get_long_description():
"clickhouse",
"delta-lake",
"druid",
+ "feast" if sys.version_info >= (3, 8) else None,
"hana",
"hive",
"iceberg" if sys.version_info >= (3, 8) else None,
@@ -634,6 +630,7 @@ def get_long_description():
"simple_add_dataset_properties = datahub.ingestion.transformer.add_dataset_properties:SimpleAddDatasetProperties",
"pattern_add_dataset_schema_terms = datahub.ingestion.transformer.add_dataset_schema_terms:PatternAddDatasetSchemaTerms",
"pattern_add_dataset_schema_tags = datahub.ingestion.transformer.add_dataset_schema_tags:PatternAddDatasetSchemaTags",
+ "extract_owners_from_tags = datahub.ingestion.transformer.extract_ownership_from_tags:ExtractOwnersFromTagsTransformer",
],
"datahub.ingestion.sink.plugins": [
"file = datahub.ingestion.sink.file:FileSink",
@@ -666,7 +663,12 @@ def get_long_description():
},
license="Apache License 2.0",
description="A CLI to work with DataHub metadata",
- long_description=get_long_description(),
+ long_description="""\
+The `acryl-datahub` package contains a CLI and SDK for interacting with DataHub,
+as well as an integration framework for pulling/pushing metadata from external systems.
+
+See the [DataHub docs](https://datahubproject.io/docs/metadata-ingestion).
+""",
long_description_content_type="text/markdown",
classifiers=[
"Development Status :: 5 - Production/Stable",
diff --git a/metadata-ingestion/src/datahub/api/entities/corpgroup/corpgroup.py b/metadata-ingestion/src/datahub/api/entities/corpgroup/corpgroup.py
index 796786beba21b..a898e35bb810e 100644
--- a/metadata-ingestion/src/datahub/api/entities/corpgroup/corpgroup.py
+++ b/metadata-ingestion/src/datahub/api/entities/corpgroup/corpgroup.py
@@ -2,7 +2,7 @@
import logging
from dataclasses import dataclass
-from typing import TYPE_CHECKING, Callable, Iterable, List, Optional, Union
+from typing import Callable, Iterable, List, Optional, Union
import pydantic
from pydantic import BaseModel
@@ -11,9 +11,10 @@
from datahub.api.entities.corpuser.corpuser import CorpUser, CorpUserGenerationConfig
from datahub.configuration.common import ConfigurationError
from datahub.configuration.validate_field_rename import pydantic_renamed_field
+from datahub.emitter.generic_emitter import Emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
-from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
+from datahub.ingestion.graph.client import DataHubGraph
from datahub.metadata.schema_classes import (
CorpGroupEditableInfoClass,
CorpGroupInfoClass,
@@ -25,9 +26,6 @@
_Aspect,
)
-if TYPE_CHECKING:
- from datahub.emitter.kafka_emitter import DatahubKafkaEmitter
-
logger = logging.getLogger(__name__)
@@ -194,30 +192,9 @@ def generate_mcp(
entityUrn=urn, aspect=StatusClass(removed=False)
)
- @staticmethod
- def _datahub_graph_from_datahub_rest_emitter(
- rest_emitter: DatahubRestEmitter,
- ) -> DataHubGraph:
- """
- Create a datahub graph instance from a REST Emitter.
- A stop-gap implementation which is expected to be removed after PATCH support is implemented
- for membership updates for users <-> groups
- """
- graph = DataHubGraph(
- config=DatahubClientConfig(
- server=rest_emitter._gms_server,
- token=rest_emitter._token,
- timeout_sec=rest_emitter._connect_timeout_sec,
- retry_status_codes=rest_emitter._retry_status_codes,
- extra_headers=rest_emitter._session.headers,
- disable_ssl_verification=rest_emitter._session.verify is False,
- )
- )
- return graph
-
def emit(
self,
- emitter: Union[DatahubRestEmitter, "DatahubKafkaEmitter"],
+ emitter: Emitter,
callback: Optional[Callable[[Exception, str], None]] = None,
) -> None:
"""
@@ -235,7 +212,7 @@ def emit(
# who are passing in a DataHubRestEmitter today
# we won't need this in the future once PATCH support is implemented as all emitters
# will work
- datahub_graph = self._datahub_graph_from_datahub_rest_emitter(emitter)
+ datahub_graph = emitter.to_graph()
for mcp in self.generate_mcp(
generation_config=CorpGroupGenerationConfig(
override_editable=self.overrideEditable, datahub_graph=datahub_graph
diff --git a/metadata-ingestion/src/datahub/api/entities/corpuser/corpuser.py b/metadata-ingestion/src/datahub/api/entities/corpuser/corpuser.py
index c67eb02a870a5..9fe1ebedafca7 100644
--- a/metadata-ingestion/src/datahub/api/entities/corpuser/corpuser.py
+++ b/metadata-ingestion/src/datahub/api/entities/corpuser/corpuser.py
@@ -1,14 +1,14 @@
from __future__ import annotations
from dataclasses import dataclass
-from typing import TYPE_CHECKING, Callable, Iterable, List, Optional, Union
+from typing import Callable, Iterable, List, Optional
import pydantic
import datahub.emitter.mce_builder as builder
from datahub.configuration.common import ConfigModel
+from datahub.emitter.generic_emitter import Emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
-from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
CorpUserEditableInfoClass,
CorpUserInfoClass,
@@ -16,9 +16,6 @@
StatusClass,
)
-if TYPE_CHECKING:
- from datahub.emitter.kafka_emitter import DatahubKafkaEmitter
-
@dataclass
class CorpUserGenerationConfig:
@@ -144,7 +141,7 @@ def generate_mcp(
def emit(
self,
- emitter: Union[DatahubRestEmitter, "DatahubKafkaEmitter"],
+ emitter: Emitter,
callback: Optional[Callable[[Exception, str], None]] = None,
) -> None:
"""
diff --git a/metadata-ingestion/src/datahub/api/entities/datajob/dataflow.py b/metadata-ingestion/src/datahub/api/entities/datajob/dataflow.py
index 8a04768bc0a72..acd708ee81a5c 100644
--- a/metadata-ingestion/src/datahub/api/entities/datajob/dataflow.py
+++ b/metadata-ingestion/src/datahub/api/entities/datajob/dataflow.py
@@ -1,18 +1,9 @@
import logging
from dataclasses import dataclass, field
-from typing import (
- TYPE_CHECKING,
- Callable,
- Dict,
- Iterable,
- List,
- Optional,
- Set,
- Union,
- cast,
-)
+from typing import Callable, Dict, Iterable, List, Optional, Set, cast
import datahub.emitter.mce_builder as builder
+from datahub.emitter.generic_emitter import Emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import (
AuditStampClass,
@@ -29,10 +20,6 @@
)
from datahub.utilities.urns.data_flow_urn import DataFlowUrn
-if TYPE_CHECKING:
- from datahub.emitter.kafka_emitter import DatahubKafkaEmitter
- from datahub.emitter.rest_emitter import DatahubRestEmitter
-
logger = logging.getLogger(__name__)
@@ -170,7 +157,7 @@ def generate_mcp(self) -> Iterable[MetadataChangeProposalWrapper]:
def emit(
self,
- emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ emitter: Emitter,
callback: Optional[Callable[[Exception, str], None]] = None,
) -> None:
"""
diff --git a/metadata-ingestion/src/datahub/api/entities/datajob/datajob.py b/metadata-ingestion/src/datahub/api/entities/datajob/datajob.py
index 7eb6fc8c8d1a9..0face6415bacc 100644
--- a/metadata-ingestion/src/datahub/api/entities/datajob/datajob.py
+++ b/metadata-ingestion/src/datahub/api/entities/datajob/datajob.py
@@ -1,16 +1,16 @@
from dataclasses import dataclass, field
-from typing import TYPE_CHECKING, Callable, Dict, Iterable, List, Optional, Set, Union
+from typing import Callable, Dict, Iterable, List, Optional, Set
import datahub.emitter.mce_builder as builder
+from datahub.emitter.generic_emitter import Emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import (
AuditStampClass,
AzkabanJobTypeClass,
DataJobInfoClass,
DataJobInputOutputClass,
- DataJobSnapshotClass,
+ FineGrainedLineageClass,
GlobalTagsClass,
- MetadataChangeEventClass,
OwnerClass,
OwnershipClass,
OwnershipSourceClass,
@@ -23,10 +23,6 @@
from datahub.utilities.urns.data_job_urn import DataJobUrn
from datahub.utilities.urns.dataset_urn import DatasetUrn
-if TYPE_CHECKING:
- from datahub.emitter.kafka_emitter import DatahubKafkaEmitter
- from datahub.emitter.rest_emitter import DatahubRestEmitter
-
@dataclass
class DataJob:
@@ -59,6 +55,7 @@ class DataJob:
group_owners: Set[str] = field(default_factory=set)
inlets: List[DatasetUrn] = field(default_factory=list)
outlets: List[DatasetUrn] = field(default_factory=list)
+ fine_grained_lineages: List[FineGrainedLineageClass] = field(default_factory=list)
upstream_urns: List[DataJobUrn] = field(default_factory=list)
def __post_init__(self):
@@ -103,31 +100,6 @@ def generate_tags_aspect(self) -> Iterable[GlobalTagsClass]:
)
return [tags]
- def generate_mce(self) -> MetadataChangeEventClass:
- job_mce = MetadataChangeEventClass(
- proposedSnapshot=DataJobSnapshotClass(
- urn=str(self.urn),
- aspects=[
- DataJobInfoClass(
- name=self.name if self.name is not None else self.id,
- type=AzkabanJobTypeClass.COMMAND,
- description=self.description,
- customProperties=self.properties,
- externalUrl=self.url,
- ),
- DataJobInputOutputClass(
- inputDatasets=[str(urn) for urn in self.inlets],
- outputDatasets=[str(urn) for urn in self.outlets],
- inputDatajobs=[str(urn) for urn in self.upstream_urns],
- ),
- *self.generate_ownership_aspect(),
- *self.generate_tags_aspect(),
- ],
- )
- )
-
- return job_mce
-
def generate_mcp(self) -> Iterable[MetadataChangeProposalWrapper]:
mcp = MetadataChangeProposalWrapper(
entityUrn=str(self.urn),
@@ -159,7 +131,7 @@ def generate_mcp(self) -> Iterable[MetadataChangeProposalWrapper]:
def emit(
self,
- emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ emitter: Emitter,
callback: Optional[Callable[[Exception, str], None]] = None,
) -> None:
"""
@@ -179,6 +151,7 @@ def generate_data_input_output_mcp(self) -> Iterable[MetadataChangeProposalWrapp
inputDatasets=[str(urn) for urn in self.inlets],
outputDatasets=[str(urn) for urn in self.outlets],
inputDatajobs=[str(urn) for urn in self.upstream_urns],
+ fineGrainedLineages=self.fine_grained_lineages,
),
)
yield mcp
diff --git a/metadata-ingestion/src/datahub/api/entities/dataprocess/dataprocess_instance.py b/metadata-ingestion/src/datahub/api/entities/dataprocess/dataprocess_instance.py
index 9ec389c3a0989..cf6080c7072e6 100644
--- a/metadata-ingestion/src/datahub/api/entities/dataprocess/dataprocess_instance.py
+++ b/metadata-ingestion/src/datahub/api/entities/dataprocess/dataprocess_instance.py
@@ -1,9 +1,10 @@
import time
from dataclasses import dataclass, field
from enum import Enum
-from typing import TYPE_CHECKING, Callable, Dict, Iterable, List, Optional, Union, cast
+from typing import Callable, Dict, Iterable, List, Optional, Union, cast
from datahub.api.entities.datajob import DataFlow, DataJob
+from datahub.emitter.generic_emitter import Emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.mcp_builder import DatahubKey
from datahub.metadata.com.linkedin.pegasus2avro.dataprocess import (
@@ -26,10 +27,6 @@
from datahub.utilities.urns.data_process_instance_urn import DataProcessInstanceUrn
from datahub.utilities.urns.dataset_urn import DatasetUrn
-if TYPE_CHECKING:
- from datahub.emitter.kafka_emitter import DatahubKafkaEmitter
- from datahub.emitter.rest_emitter import DatahubRestEmitter
-
class DataProcessInstanceKey(DatahubKey):
cluster: str
@@ -106,7 +103,7 @@ def start_event_mcp(
def emit_process_start(
self,
- emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ emitter: Emitter,
start_timestamp_millis: int,
attempt: Optional[int] = None,
emit_template: bool = True,
@@ -197,7 +194,7 @@ def end_event_mcp(
def emit_process_end(
self,
- emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ emitter: Emitter,
end_timestamp_millis: int,
result: InstanceRunResult,
result_type: Optional[str] = None,
@@ -207,7 +204,7 @@ def emit_process_end(
"""
Generate an DataProcessInstance finish event and emits is
- :param emitter: (Union[DatahubRestEmitter, DatahubKafkaEmitter]) the datahub emitter to emit generated mcps
+ :param emitter: (Emitter) the datahub emitter to emit generated mcps
:param end_timestamp_millis: (int) the end time of the execution in milliseconds
:param result: (InstanceRunResult) The result of the run
:param result_type: (string) It identifies the system where the native result comes from like Airflow, Azkaban
@@ -261,24 +258,24 @@ def generate_mcp(
@staticmethod
def _emit_mcp(
mcp: MetadataChangeProposalWrapper,
- emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ emitter: Emitter,
callback: Optional[Callable[[Exception, str], None]] = None,
) -> None:
"""
- :param emitter: (Union[DatahubRestEmitter, DatahubKafkaEmitter]) the datahub emitter to emit generated mcps
+ :param emitter: (Emitter) the datahub emitter to emit generated mcps
:param callback: (Optional[Callable[[Exception, str], None]]) the callback method for KafkaEmitter if it is used
"""
emitter.emit(mcp, callback)
def emit(
self,
- emitter: Union["DatahubRestEmitter", "DatahubKafkaEmitter"],
+ emitter: Emitter,
callback: Optional[Callable[[Exception, str], None]] = None,
) -> None:
"""
- :param emitter: (Union[DatahubRestEmitter, DatahubKafkaEmitter]) the datahub emitter to emit generated mcps
+ :param emitter: (Emitter) the datahub emitter to emit generated mcps
:param callback: (Optional[Callable[[Exception, str], None]]) the callback method for KafkaEmitter if it is used
"""
for mcp in self.generate_mcp():
diff --git a/metadata-ingestion/src/datahub/api/entities/dataproduct/dataproduct.py b/metadata-ingestion/src/datahub/api/entities/dataproduct/dataproduct.py
index 04f12b4f61d1e..2d9b14ceb2d06 100644
--- a/metadata-ingestion/src/datahub/api/entities/dataproduct/dataproduct.py
+++ b/metadata-ingestion/src/datahub/api/entities/dataproduct/dataproduct.py
@@ -2,25 +2,15 @@
import time
from pathlib import Path
-from typing import (
- TYPE_CHECKING,
- Any,
- Callable,
- Dict,
- Iterable,
- List,
- Optional,
- Tuple,
- Union,
-)
+from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
import pydantic
from ruamel.yaml import YAML
import datahub.emitter.mce_builder as builder
from datahub.configuration.common import ConfigModel
+from datahub.emitter.generic_emitter import Emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
-from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DataHubGraph
from datahub.metadata.schema_classes import (
AuditStampClass,
@@ -43,9 +33,6 @@
from datahub.utilities.registries.domain_registry import DomainRegistry
from datahub.utilities.urns.urn import Urn
-if TYPE_CHECKING:
- from datahub.emitter.kafka_emitter import DatahubKafkaEmitter
-
def patch_list(
orig_list: Optional[list],
@@ -225,7 +212,6 @@ def _generate_properties_mcp(
def generate_mcp(
self, upsert: bool
) -> Iterable[Union[MetadataChangeProposalWrapper, MetadataChangeProposalClass]]:
-
if self._resolved_domain_urn is None:
raise Exception(
f"Unable to generate MCP-s because we were unable to resolve the domain {self.domain} to an urn."
@@ -282,7 +268,7 @@ def generate_mcp(
def emit(
self,
- emitter: Union[DatahubRestEmitter, "DatahubKafkaEmitter"],
+ emitter: Emitter,
upsert: bool,
callback: Optional[Callable[[Exception, str], None]] = None,
) -> None:
@@ -440,7 +426,6 @@ def patch_yaml(
original_dataproduct: DataProduct,
output_file: Path,
) -> bool:
-
update_needed = False
if not original_dataproduct._original_yaml_dict:
raise Exception("Original Data Product was not loaded from yaml")
@@ -523,7 +508,6 @@ def to_yaml(
self,
file: Path,
) -> None:
-
with open(file, "w") as fp:
yaml = YAML(typ="rt") # default, if not specfied, is 'rt' (round-trip)
yaml.indent(mapping=2, sequence=4, offset=2)
diff --git a/metadata-ingestion/src/datahub/cli/delete_cli.py b/metadata-ingestion/src/datahub/cli/delete_cli.py
index 7ab7605ef6363..f9e0eb45692d4 100644
--- a/metadata-ingestion/src/datahub/cli/delete_cli.py
+++ b/metadata-ingestion/src/datahub/cli/delete_cli.py
@@ -13,11 +13,8 @@
from datahub.cli import cli_utils
from datahub.configuration.datetimes import ClickDatetime
from datahub.emitter.aspect import ASPECT_MAP, TIMESERIES_ASPECT_MAP
-from datahub.ingestion.graph.client import (
- DataHubGraph,
- RemovedStatusFilter,
- get_default_graph,
-)
+from datahub.ingestion.graph.client import DataHubGraph, get_default_graph
+from datahub.ingestion.graph.filters import RemovedStatusFilter
from datahub.telemetry import telemetry
from datahub.upgrade import upgrade
from datahub.utilities.perf_timer import PerfTimer
diff --git a/metadata-ingestion/src/datahub/cli/docker_cli.py b/metadata-ingestion/src/datahub/cli/docker_cli.py
index 9fde47c82873c..4afccfe711e34 100644
--- a/metadata-ingestion/src/datahub/cli/docker_cli.py
+++ b/metadata-ingestion/src/datahub/cli/docker_cli.py
@@ -426,7 +426,7 @@ def detect_quickstart_arch(arch: Optional[str]) -> Architectures:
return quickstart_arch
-@docker.command()
+@docker.command() # noqa: C901
@click.option(
"--version",
type=str,
@@ -588,7 +588,7 @@ def detect_quickstart_arch(arch: Optional[str]) -> Architectures:
"arch",
]
)
-def quickstart(
+def quickstart( # noqa: C901
version: Optional[str],
build_locally: bool,
pull_images: bool,
@@ -755,14 +755,21 @@ def quickstart(
up_attempts += 1
logger.debug(f"Executing docker compose up command, attempt #{up_attempts}")
+ up_process = subprocess.Popen(
+ base_command + ["up", "-d", "--remove-orphans"],
+ env=_docker_subprocess_env(),
+ )
try:
- subprocess.run(
- base_command + ["up", "-d", "--remove-orphans"],
- env=_docker_subprocess_env(),
- timeout=_QUICKSTART_UP_TIMEOUT.total_seconds(),
- )
+ up_process.wait(timeout=_QUICKSTART_UP_TIMEOUT.total_seconds())
except subprocess.TimeoutExpired:
- logger.debug("docker compose up timed out, will retry")
+ logger.debug("docker compose up timed out, sending SIGTERM")
+ up_process.terminate()
+ try:
+ up_process.wait(timeout=3)
+ except subprocess.TimeoutExpired:
+ logger.debug("docker compose up still running, sending SIGKILL")
+ up_process.kill()
+ up_process.wait()
# Check docker health every few seconds.
status = check_docker_quickstart()
diff --git a/metadata-ingestion/src/datahub/emitter/generic_emitter.py b/metadata-ingestion/src/datahub/emitter/generic_emitter.py
new file mode 100644
index 0000000000000..28138c6182758
--- /dev/null
+++ b/metadata-ingestion/src/datahub/emitter/generic_emitter.py
@@ -0,0 +1,31 @@
+from typing import Any, Callable, Optional, Union
+
+from typing_extensions import Protocol
+
+from datahub.emitter.mcp import MetadataChangeProposalWrapper
+from datahub.metadata.com.linkedin.pegasus2avro.mxe import (
+ MetadataChangeEvent,
+ MetadataChangeProposal,
+)
+
+
+class Emitter(Protocol):
+ def emit(
+ self,
+ item: Union[
+ MetadataChangeEvent,
+ MetadataChangeProposal,
+ MetadataChangeProposalWrapper,
+ ],
+ # NOTE: This signature should have the exception be optional rather than
+ # required. However, this would be a breaking change that may need
+ # more careful consideration.
+ callback: Optional[Callable[[Exception, str], None]] = None,
+ # TODO: The rest emitter returns timestamps as the return type. For now
+ # we smooth over that detail using Any, but eventually we should
+ # standardize on a return type.
+ ) -> Any:
+ raise NotImplementedError
+
+ def flush(self) -> None:
+ pass
diff --git a/metadata-ingestion/src/datahub/emitter/kafka_emitter.py b/metadata-ingestion/src/datahub/emitter/kafka_emitter.py
index ec0c8f3418a4a..781930011b78f 100644
--- a/metadata-ingestion/src/datahub/emitter/kafka_emitter.py
+++ b/metadata-ingestion/src/datahub/emitter/kafka_emitter.py
@@ -10,6 +10,7 @@
from datahub.configuration.common import ConfigModel
from datahub.configuration.kafka import KafkaProducerConnectionConfig
from datahub.configuration.validate_field_rename import pydantic_renamed_field
+from datahub.emitter.generic_emitter import Emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.ingestion.api.closeable import Closeable
from datahub.metadata.schema_classes import (
@@ -55,7 +56,7 @@ def validate_topic_routes(cls, v: Dict[str, str]) -> Dict[str, str]:
return v
-class DatahubKafkaEmitter(Closeable):
+class DatahubKafkaEmitter(Closeable, Emitter):
def __init__(self, config: KafkaEmitterConfig):
self.config = config
schema_registry_conf = {
diff --git a/metadata-ingestion/src/datahub/emitter/rest_emitter.py b/metadata-ingestion/src/datahub/emitter/rest_emitter.py
index 937e0902d6d8c..afb19df9791af 100644
--- a/metadata-ingestion/src/datahub/emitter/rest_emitter.py
+++ b/metadata-ingestion/src/datahub/emitter/rest_emitter.py
@@ -4,7 +4,7 @@
import logging
import os
from json.decoder import JSONDecodeError
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union
import requests
from deprecated import deprecated
@@ -13,6 +13,7 @@
from datahub.cli.cli_utils import get_system_auth
from datahub.configuration.common import ConfigurationError, OperationalError
+from datahub.emitter.generic_emitter import Emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.request_helper import make_curl_command
from datahub.emitter.serialization_helper import pre_json_transform
@@ -23,6 +24,9 @@
)
from datahub.metadata.com.linkedin.pegasus2avro.usage import UsageAggregation
+if TYPE_CHECKING:
+ from datahub.ingestion.graph.client import DataHubGraph
+
logger = logging.getLogger(__name__)
_DEFAULT_CONNECT_TIMEOUT_SEC = 30 # 30 seconds should be plenty to connect
@@ -42,7 +46,7 @@
)
-class DataHubRestEmitter(Closeable):
+class DataHubRestEmitter(Closeable, Emitter):
_gms_server: str
_token: Optional[str]
_session: requests.Session
@@ -190,6 +194,11 @@ def test_connection(self) -> dict:
message += "\nPlease check your configuration and make sure you are talking to the DataHub GMS (usually :8080) or Frontend GMS API (usually :9002/api/gms)."
raise ConfigurationError(message)
+ def to_graph(self) -> "DataHubGraph":
+ from datahub.ingestion.graph.client import DataHubGraph
+
+ return DataHubGraph.from_emitter(self)
+
def emit(
self,
item: Union[
@@ -198,9 +207,6 @@ def emit(
MetadataChangeProposalWrapper,
UsageAggregation,
],
- # NOTE: This signature should have the exception be optional rather than
- # required. However, this would be a breaking change that may need
- # more careful consideration.
callback: Optional[Callable[[Exception, str], None]] = None,
) -> Tuple[datetime.datetime, datetime.datetime]:
start_time = datetime.datetime.now()
diff --git a/metadata-ingestion/src/datahub/emitter/synchronized_file_emitter.py b/metadata-ingestion/src/datahub/emitter/synchronized_file_emitter.py
new file mode 100644
index 0000000000000..f82882f1a87cc
--- /dev/null
+++ b/metadata-ingestion/src/datahub/emitter/synchronized_file_emitter.py
@@ -0,0 +1,60 @@
+import logging
+import pathlib
+from typing import Callable, Optional, Union
+
+import filelock
+
+from datahub.emitter.generic_emitter import Emitter
+from datahub.emitter.mcp import MetadataChangeProposalWrapper
+from datahub.ingestion.api.closeable import Closeable
+from datahub.ingestion.sink.file import write_metadata_file
+from datahub.ingestion.source.file import read_metadata_file
+from datahub.metadata.com.linkedin.pegasus2avro.mxe import (
+ MetadataChangeEvent,
+ MetadataChangeProposal,
+)
+
+logger = logging.getLogger(__name__)
+
+
+class SynchronizedFileEmitter(Closeable, Emitter):
+ """
+ A multiprocessing-safe emitter that writes to a file.
+
+ This emitter is intended for testing purposes only. It is not performant
+ because it reads and writes the full file on every emit call to ensure
+ that the file is always valid JSON.
+ """
+
+ def __init__(self, filename: str) -> None:
+ self._filename = pathlib.Path(filename)
+ self._lock = filelock.FileLock(self._filename.with_suffix(".lock"))
+
+ def emit(
+ self,
+ item: Union[
+ MetadataChangeEvent, MetadataChangeProposal, MetadataChangeProposalWrapper
+ ],
+ callback: Optional[Callable[[Exception, str], None]] = None,
+ ) -> None:
+ with self._lock:
+ if self._filename.exists():
+ metadata = list(read_metadata_file(self._filename))
+ else:
+ metadata = []
+
+ logger.debug("Emitting metadata: %s", item)
+ metadata.append(item)
+
+ write_metadata_file(self._filename, metadata)
+
+ def __repr__(self) -> str:
+ return f"SynchronizedFileEmitter('{self._filename}')"
+
+ def flush(self) -> None:
+ # No-op.
+ pass
+
+ def close(self) -> None:
+ # No-op.
+ pass
diff --git a/metadata-ingestion/src/datahub/ingestion/graph/client.py b/metadata-ingestion/src/datahub/ingestion/graph/client.py
index 38e965f7f6587..ccff677c3a471 100644
--- a/metadata-ingestion/src/datahub/ingestion/graph/client.py
+++ b/metadata-ingestion/src/datahub/ingestion/graph/client.py
@@ -7,7 +7,7 @@
from dataclasses import dataclass
from datetime import datetime
from json.decoder import JSONDecodeError
-from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Set, Tuple, Type
+from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Tuple, Type
from avro.schema import RecordSchema
from deprecated import deprecated
@@ -16,15 +16,15 @@
from datahub.cli.cli_utils import get_url_and_token
from datahub.configuration.common import ConfigModel, GraphError, OperationalError
from datahub.emitter.aspect import TIMESERIES_ASPECT_MAP
-from datahub.emitter.mce_builder import (
- DEFAULT_ENV,
- Aspect,
- make_data_platform_urn,
- make_dataplatform_instance_urn,
-)
+from datahub.emitter.mce_builder import DEFAULT_ENV, Aspect
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.emitter.serialization_helper import post_json_transform
+from datahub.ingestion.graph.filters import (
+ RemovedStatusFilter,
+ SearchFilterRule,
+ generate_filter,
+)
from datahub.ingestion.source.state.checkpoint import Checkpoint
from datahub.metadata.schema_classes import (
ASPECT_NAME_MAP,
@@ -59,8 +59,6 @@
logger = logging.getLogger(__name__)
-SearchFilterRule = Dict[str, Any]
-
class DatahubClientConfig(ConfigModel):
"""Configuration class for holding connectivity to datahub gms"""
@@ -81,19 +79,6 @@ class DatahubClientConfig(ConfigModel):
DataHubGraphConfig = DatahubClientConfig
-class RemovedStatusFilter(enum.Enum):
- """Filter for the status of entities during search."""
-
- NOT_SOFT_DELETED = "NOT_SOFT_DELETED"
- """Search only entities that have not been marked as deleted."""
-
- ALL = "ALL"
- """Search all entities, including deleted entities."""
-
- ONLY_SOFT_DELETED = "ONLY_SOFT_DELETED"
- """Search only soft-deleted entities."""
-
-
@dataclass
class RelatedEntity:
urn: str
@@ -153,6 +138,23 @@ def __init__(self, config: DatahubClientConfig) -> None:
self.server_id = "missing"
logger.debug(f"Failed to get server id due to {e}")
+ @classmethod
+ def from_emitter(cls, emitter: DatahubRestEmitter) -> "DataHubGraph":
+ return cls(
+ DatahubClientConfig(
+ server=emitter._gms_server,
+ token=emitter._token,
+ timeout_sec=emitter._read_timeout_sec,
+ retry_status_codes=emitter._retry_status_codes,
+ retry_max_times=emitter._retry_max_times,
+ extra_headers=emitter._session.headers,
+ disable_ssl_verification=emitter._session.verify is False,
+ # TODO: Support these headers.
+ # ca_certificate_path=emitter._ca_certificate_path,
+ # client_certificate_path=emitter._client_certificate_path,
+ )
+ )
+
def _send_restli_request(self, method: str, url: str, **kwargs: Any) -> Dict:
try:
response = self._session.request(method, url, **kwargs)
@@ -567,7 +569,7 @@ def _bulk_fetch_schema_info_by_filter(
# Add the query default of * if no query is specified.
query = query or "*"
- orFilters = self.generate_filter(
+ orFilters = generate_filter(
platform, platform_instance, env, container, status, extraFilters
)
@@ -621,54 +623,6 @@ def _bulk_fetch_schema_info_by_filter(
if entity.get("schemaMetadata"):
yield entity["urn"], entity["schemaMetadata"]
- def generate_filter(
- self,
- platform: Optional[str],
- platform_instance: Optional[str],
- env: Optional[str],
- container: Optional[str],
- status: RemovedStatusFilter,
- extraFilters: Optional[List[SearchFilterRule]],
- ) -> List[Dict[str, List[SearchFilterRule]]]:
- andFilters: List[SearchFilterRule] = []
-
- # Platform filter.
- if platform:
- andFilters.append(self._get_platform_filter(platform))
-
- # Platform instance filter.
- if platform_instance:
- andFilters.append(
- self._get_platform_instance_filter(platform, platform_instance)
- )
-
- # Browse path v2 filter.
- if container:
- andFilters.append(self._get_container_filter(container))
-
- # Status filter.
- status_filter = self._get_status_filer(status)
- if status_filter:
- andFilters.append(status_filter)
-
- # Extra filters.
- if extraFilters:
- andFilters += extraFilters
-
- orFilters: List[Dict[str, List[SearchFilterRule]]] = [{"and": andFilters}]
-
- # Env filter
- if env:
- envOrConditions = self._get_env_or_conditions(env)
- # This matches ALL of the andFilters and at least one of the envOrConditions.
- orFilters = [
- {"and": andFilters["and"] + [extraCondition]}
- for extraCondition in envOrConditions
- for andFilters in orFilters
- ]
-
- return orFilters
-
def get_urns_by_filter(
self,
*,
@@ -709,7 +663,7 @@ def get_urns_by_filter(
query = query or "*"
# Env filter.
- orFilters = self.generate_filter(
+ orFilters = generate_filter(
platform, platform_instance, env, container, status, extraFilters
)
@@ -778,98 +732,6 @@ def _scroll_across_entities(
f"Scrolling to next scrollAcrossEntities page: {scroll_id}"
)
- def _get_env_or_conditions(self, env: str) -> List[SearchFilterRule]:
- # The env filter is a bit more tricky since it's not always stored
- # in the same place in ElasticSearch.
- return [
- # For most entity types, we look at the origin field.
- {
- "field": "origin",
- "value": env,
- "condition": "EQUAL",
- },
- # For containers, we look at the customProperties field.
- # For any containers created after https://github.com/datahub-project/datahub/pull/8027,
- # we look for the "env" property. Otherwise, we use the "instance" property.
- {
- "field": "customProperties",
- "value": f"env={env}",
- },
- {
- "field": "customProperties",
- "value": f"instance={env}",
- },
- # Note that not all entity types have an env (e.g. dashboards / charts).
- # If the env filter is specified, these will be excluded.
- ]
-
- def _get_status_filer(
- self, status: RemovedStatusFilter
- ) -> Optional[SearchFilterRule]:
- if status == RemovedStatusFilter.NOT_SOFT_DELETED:
- # Subtle: in some cases (e.g. when the dataset doesn't have a status aspect), the
- # removed field is simply not present in the ElasticSearch document. Ideally this
- # would be a "removed" : "false" filter, but that doesn't work. Instead, we need to
- # use a negated filter.
- return {
- "field": "removed",
- "values": ["true"],
- "condition": "EQUAL",
- "negated": True,
- }
-
- elif status == RemovedStatusFilter.ONLY_SOFT_DELETED:
- return {
- "field": "removed",
- "values": ["true"],
- "condition": "EQUAL",
- }
-
- elif status == RemovedStatusFilter.ALL:
- # We don't need to add a filter for this case.
- return None
- else:
- raise ValueError(f"Invalid status filter: {status}")
-
- def _get_container_filter(self, container: str) -> SearchFilterRule:
- # Warn if container is not a fully qualified urn.
- # TODO: Change this once we have a first-class container urn type.
- if guess_entity_type(container) != "container":
- raise ValueError(f"Invalid container urn: {container}")
-
- return {
- "field": "browsePathV2",
- "values": [container],
- "condition": "CONTAIN",
- }
-
- def _get_platform_instance_filter(
- self, platform: Optional[str], platform_instance: str
- ) -> SearchFilterRule:
- if platform:
- # Massage the platform instance into a fully qualified urn, if necessary.
- platform_instance = make_dataplatform_instance_urn(
- platform, platform_instance
- )
-
- # Warn if platform_instance is not a fully qualified urn.
- # TODO: Change this once we have a first-class data platform instance urn type.
- if guess_entity_type(platform_instance) != "dataPlatformInstance":
- raise ValueError(f"Invalid data platform instance urn: {platform_instance}")
-
- return {
- "field": "platformInstance",
- "values": [platform_instance],
- "condition": "EQUAL",
- }
-
- def _get_platform_filter(self, platform: str) -> SearchFilterRule:
- return {
- "field": "platform.keyword",
- "values": [make_data_platform_urn(platform)],
- "condition": "EQUAL",
- }
-
def _get_types(self, entity_types: Optional[List[str]]) -> Optional[List[str]]:
types: Optional[List[str]] = None
if entity_types is not None:
@@ -960,7 +822,7 @@ def get_related_entities(
url=relationship_endpoint,
params={
"urn": entity_urn,
- "direction": direction,
+ "direction": direction.value,
"relationshipTypes": relationship_types,
"start": start,
},
@@ -1148,14 +1010,13 @@ def _make_schema_resolver(
def initialize_schema_resolver_from_datahub(
self, platform: str, platform_instance: Optional[str], env: str
- ) -> Tuple["SchemaResolver", Set[str]]:
+ ) -> "SchemaResolver":
logger.info("Initializing schema resolver")
schema_resolver = self._make_schema_resolver(
platform, platform_instance, env, include_graph=False
)
logger.info(f"Fetching schemas for platform {platform}, env {env}")
- urns = []
count = 0
with PerfTimer() as timer:
for urn, schema_info in self._bulk_fetch_schema_info_by_filter(
@@ -1164,7 +1025,6 @@ def initialize_schema_resolver_from_datahub(
env=env,
):
try:
- urns.append(urn)
schema_resolver.add_graphql_schema_metadata(urn, schema_info)
count += 1
except Exception:
@@ -1179,7 +1039,7 @@ def initialize_schema_resolver_from_datahub(
)
logger.info("Finished initializing schema resolver")
- return schema_resolver, set(urns)
+ return schema_resolver
def parse_sql_lineage(
self,
diff --git a/metadata-ingestion/src/datahub/ingestion/graph/filters.py b/metadata-ingestion/src/datahub/ingestion/graph/filters.py
new file mode 100644
index 0000000000000..1a63aea835729
--- /dev/null
+++ b/metadata-ingestion/src/datahub/ingestion/graph/filters.py
@@ -0,0 +1,162 @@
+import enum
+from typing import Any, Dict, List, Optional
+
+from datahub.emitter.mce_builder import (
+ make_data_platform_urn,
+ make_dataplatform_instance_urn,
+)
+from datahub.utilities.urns.urn import guess_entity_type
+
+SearchFilterRule = Dict[str, Any]
+
+
+class RemovedStatusFilter(enum.Enum):
+ """Filter for the status of entities during search."""
+
+ NOT_SOFT_DELETED = "NOT_SOFT_DELETED"
+ """Search only entities that have not been marked as deleted."""
+
+ ALL = "ALL"
+ """Search all entities, including deleted entities."""
+
+ ONLY_SOFT_DELETED = "ONLY_SOFT_DELETED"
+ """Search only soft-deleted entities."""
+
+
+def generate_filter(
+ platform: Optional[str],
+ platform_instance: Optional[str],
+ env: Optional[str],
+ container: Optional[str],
+ status: RemovedStatusFilter,
+ extra_filters: Optional[List[SearchFilterRule]],
+) -> List[Dict[str, List[SearchFilterRule]]]:
+ and_filters: List[SearchFilterRule] = []
+
+ # Platform filter.
+ if platform:
+ and_filters.append(_get_platform_filter(platform))
+
+ # Platform instance filter.
+ if platform_instance:
+ and_filters.append(_get_platform_instance_filter(platform, platform_instance))
+
+ # Browse path v2 filter.
+ if container:
+ and_filters.append(_get_container_filter(container))
+
+ # Status filter.
+ status_filter = _get_status_filter(status)
+ if status_filter:
+ and_filters.append(status_filter)
+
+ # Extra filters.
+ if extra_filters:
+ and_filters += extra_filters
+
+ or_filters: List[Dict[str, List[SearchFilterRule]]] = [{"and": and_filters}]
+
+ # Env filter
+ if env:
+ env_filters = _get_env_filters(env)
+ # This matches ALL the and_filters and at least one of the envOrConditions.
+ or_filters = [
+ {"and": and_filter["and"] + [extraCondition]}
+ for extraCondition in env_filters
+ for and_filter in or_filters
+ ]
+
+ return or_filters
+
+
+def _get_env_filters(env: str) -> List[SearchFilterRule]:
+ # The env filter is a bit more tricky since it's not always stored
+ # in the same place in ElasticSearch.
+ return [
+ # For most entity types, we look at the origin field.
+ {
+ "field": "origin",
+ "value": env,
+ "condition": "EQUAL",
+ },
+ # For containers, we look at the customProperties field.
+ # For any containers created after https://github.com/datahub-project/datahub/pull/8027,
+ # we look for the "env" property. Otherwise, we use the "instance" property.
+ {
+ "field": "customProperties",
+ "value": f"env={env}",
+ },
+ {
+ "field": "customProperties",
+ "value": f"instance={env}",
+ },
+ # Note that not all entity types have an env (e.g. dashboards / charts).
+ # If the env filter is specified, these will be excluded.
+ ]
+
+
+def _get_status_filter(status: RemovedStatusFilter) -> Optional[SearchFilterRule]:
+ if status == RemovedStatusFilter.NOT_SOFT_DELETED:
+ # Subtle: in some cases (e.g. when the dataset doesn't have a status aspect), the
+ # removed field is simply not present in the ElasticSearch document. Ideally this
+ # would be a "removed" : "false" filter, but that doesn't work. Instead, we need to
+ # use a negated filter.
+ return {
+ "field": "removed",
+ "values": ["true"],
+ "condition": "EQUAL",
+ "negated": True,
+ }
+
+ elif status == RemovedStatusFilter.ONLY_SOFT_DELETED:
+ return {
+ "field": "removed",
+ "values": ["true"],
+ "condition": "EQUAL",
+ }
+
+ elif status == RemovedStatusFilter.ALL:
+ # We don't need to add a filter for this case.
+ return None
+ else:
+ raise ValueError(f"Invalid status filter: {status}")
+
+
+def _get_container_filter(container: str) -> SearchFilterRule:
+ # Warn if container is not a fully qualified urn.
+ # TODO: Change this once we have a first-class container urn type.
+ if guess_entity_type(container) != "container":
+ raise ValueError(f"Invalid container urn: {container}")
+
+ return {
+ "field": "browsePathV2",
+ "values": [container],
+ "condition": "CONTAIN",
+ }
+
+
+def _get_platform_instance_filter(
+ platform: Optional[str], platform_instance: str
+) -> SearchFilterRule:
+ if platform:
+ # Massage the platform instance into a fully qualified urn, if necessary.
+ platform_instance = make_dataplatform_instance_urn(platform, platform_instance)
+
+ # Warn if platform_instance is not a fully qualified urn.
+ # TODO: Change this once we have a first-class data platform instance urn type.
+ if guess_entity_type(platform_instance) != "dataPlatformInstance":
+ raise ValueError(f"Invalid data platform instance urn: {platform_instance}")
+
+ return {
+ "field": "platformInstance",
+ "values": [platform_instance],
+ "condition": "EQUAL",
+ }
+
+
+def _get_platform_filter(platform: str) -> SearchFilterRule:
+ return {
+ "field": "platform.keyword",
+ "values": [make_data_platform_urn(platform)],
+ "condition": "EQUAL",
+ }
diff --git a/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py b/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py
index 8a16b1a4a5f6b..f6adbcf033bcc 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py
@@ -458,7 +458,7 @@ def _init_schema_resolver(self) -> SchemaResolver:
platform=self.platform,
platform_instance=self.config.platform_instance,
env=self.config.env,
- )[0]
+ )
else:
logger.warning(
"Failed to load schema info from DataHub as DataHubGraph is missing.",
diff --git a/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries.py b/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries.py
index 5be7a0a7f6b2f..a87cb8c1cbfa5 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries.py
@@ -43,14 +43,14 @@ class BigqueryQuery:
t.creation_time as created,
ts.last_modified_time as last_altered,
tos.OPTION_VALUE as comment,
- is_insertable_into,
- ddl,
- row_count,
- size_bytes as bytes,
- num_partitions,
- max_partition_id,
- active_billable_bytes,
- long_term_billable_bytes,
+ t.is_insertable_into,
+ t.ddl,
+ ts.row_count,
+ ts.size_bytes as bytes,
+ p.num_partitions,
+ p.max_partition_id,
+ p.active_billable_bytes,
+ p.long_term_billable_bytes,
REGEXP_EXTRACT(t.table_name, r".*_(\\d+)$") as table_suffix,
REGEXP_REPLACE(t.table_name, r"_(\\d+)$", "") as table_base
@@ -90,8 +90,8 @@ class BigqueryQuery:
t.table_type as table_type,
t.creation_time as created,
tos.OPTION_VALUE as comment,
- is_insertable_into,
- ddl,
+ t.is_insertable_into,
+ t.ddl,
REGEXP_EXTRACT(t.table_name, r".*_(\\d+)$") as table_suffix,
REGEXP_REPLACE(t.table_name, r"_(\\d+)$", "") as table_base
@@ -118,10 +118,10 @@ class BigqueryQuery:
t.creation_time as created,
ts.last_modified_time as last_altered,
tos.OPTION_VALUE as comment,
- is_insertable_into,
- ddl as view_definition,
- row_count,
- size_bytes
+ t.is_insertable_into,
+ t.ddl as view_definition,
+ ts.row_count,
+ ts.size_bytes
FROM
`{{project_id}}`.`{{dataset_name}}`.INFORMATION_SCHEMA.TABLES t
join `{{project_id}}`.`{{dataset_name}}`.__TABLES__ as ts on ts.table_id = t.TABLE_NAME
@@ -143,8 +143,8 @@ class BigqueryQuery:
t.table_type as table_type,
t.creation_time as created,
tos.OPTION_VALUE as comment,
- is_insertable_into,
- ddl as view_definition
+ t.is_insertable_into,
+ t.ddl as view_definition
FROM
`{{project_id}}`.`{{dataset_name}}`.INFORMATION_SCHEMA.TABLES t
left join `{{project_id}}`.`{{dataset_name}}`.INFORMATION_SCHEMA.TABLE_OPTIONS as tos on t.table_schema = tos.table_schema
diff --git a/metadata-ingestion/src/datahub/ingestion/source/delta_lake/source.py b/metadata-ingestion/src/datahub/ingestion/source/delta_lake/source.py
index 180ef00459214..c4d01be52ae7d 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/delta_lake/source.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/delta_lake/source.py
@@ -296,7 +296,8 @@ def get_storage_options(self) -> Dict[str, str]:
"AWS_SECRET_ACCESS_KEY": creds.get("aws_secret_access_key") or "",
"AWS_SESSION_TOKEN": creds.get("aws_session_token") or "",
# Allow http connections, this is required for minio
- "AWS_STORAGE_ALLOW_HTTP": "true",
+ "AWS_STORAGE_ALLOW_HTTP": "true", # for delta-lake < 0.11.0
+ "AWS_ALLOW_HTTP": "true", # for delta-lake >= 0.11.0
}
if aws_config.aws_region:
opts["AWS_REGION"] = aws_config.aws_region
diff --git a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect.py b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect.py
index f3344782917ab..5fae0ee5215a3 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/kafka_connect.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/kafka_connect.py
@@ -28,7 +28,9 @@
)
from datahub.ingestion.api.source import MetadataWorkUnitProcessor, Source
from datahub.ingestion.api.workunit import MetadataWorkUnit
-from datahub.ingestion.source.sql.sql_common import get_platform_from_sqlalchemy_uri
+from datahub.ingestion.source.sql.sqlalchemy_uri_mapper import (
+ get_platform_from_sqlalchemy_uri,
+)
from datahub.ingestion.source.state.stale_entity_removal_handler import (
StaleEntityRemovalHandler,
StaleEntityRemovalSourceReport,
diff --git a/metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py b/metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py
index ffa685fb25826..a8c7e48f3785c 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py
@@ -397,6 +397,42 @@ class PowerBiDashboardSourceConfig(
"as this option generates the upstream datasets URN in lowercase.",
)
+ # Enable CLL extraction
+ extract_column_level_lineage: bool = pydantic.Field(
+ default=False,
+ description="Whether to extract column level lineage. "
+ "Works only if configs `native_query_parsing`, `enable_advance_lineage_sql_construct` & `extract_lineage` are enabled. "
+ "Works for M-Query where native SQL is used for transformation.",
+ )
+
+ @root_validator
+ @classmethod
+ def validate_extract_column_level_lineage(cls, values: Dict) -> Dict:
+ flags = [
+ "native_query_parsing",
+ "enable_advance_lineage_sql_construct",
+ "extract_lineage",
+ ]
+
+ if (
+ "extract_column_level_lineage" in values
+ and values["extract_column_level_lineage"] is False
+ ):
+ # Flag is not set. skip validation
+ return values
+
+ logger.debug(f"Validating additional flags: {flags}")
+
+ is_flag_enabled: bool = True
+ for flag in flags:
+ if flag not in values or values[flag] is False:
+ is_flag_enabled = False
+
+ if not is_flag_enabled:
+ raise ValueError(f"Enable all these flags in recipe: {flags} ")
+
+ return values
+
@validator("dataset_type_mapping")
@classmethod
def map_data_platform(cls, value):
diff --git a/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/native_sql_parser.py b/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/native_sql_parser.py
index 021c429c3c633..0afa8e7ff4564 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/native_sql_parser.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/native_sql_parser.py
@@ -9,7 +9,7 @@
SPECIAL_CHARACTERS = ["#(lf)", "(lf)"]
-logger = logging.getLogger()
+logger = logging.getLogger(__name__)
def remove_special_characters(native_query: str) -> str:
@@ -21,7 +21,7 @@ def remove_special_characters(native_query: str) -> str:
def get_tables(native_query: str) -> List[str]:
native_query = remove_special_characters(native_query)
- logger.debug(f"Processing query = {native_query}")
+ logger.debug(f"Processing native query = {native_query}")
tables: List[str] = []
parsed = sqlparse.parse(native_query)[0]
tokens: List[sqlparse.sql.Token] = list(parsed.tokens)
@@ -65,7 +65,7 @@ def parse_custom_sql(
sql_query = remove_special_characters(query)
- logger.debug(f"Parsing sql={sql_query}")
+ logger.debug(f"Processing native query = {sql_query}")
return sqlglot_l.create_lineage_sql_parsed_result(
query=sql_query,
diff --git a/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/parser.py b/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/parser.py
index 8cc38c366c42a..9134932c39fe0 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/parser.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/parser.py
@@ -56,7 +56,7 @@ def get_upstream_tables(
ctx: PipelineContext,
config: PowerBiDashboardSourceConfig,
parameters: Dict[str, str] = {},
-) -> List[resolver.DataPlatformTable]:
+) -> List[resolver.Lineage]:
if table.expression is None:
logger.debug(f"Expression is none for table {table.full_name}")
return []
diff --git a/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/resolver.py b/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/resolver.py
index 479f1decff903..e200ff41f71c2 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/resolver.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/resolver.py
@@ -27,7 +27,7 @@
IdentifierAccessor,
)
from datahub.ingestion.source.powerbi.rest_api_wrapper.data_classes import Table
-from datahub.utilities.sqlglot_lineage import SqlParsingResult
+from datahub.utilities.sqlglot_lineage import ColumnLineageInfo, SqlParsingResult
logger = logging.getLogger(__name__)
@@ -38,6 +38,16 @@ class DataPlatformTable:
urn: str
+@dataclass
+class Lineage:
+ upstreams: List[DataPlatformTable]
+ column_lineage: List[ColumnLineageInfo]
+
+ @staticmethod
+ def empty() -> "Lineage":
+ return Lineage(upstreams=[], column_lineage=[])
+
+
def urn_to_lowercase(value: str, flag: bool) -> str:
if flag is True:
return value.lower()
@@ -120,9 +130,9 @@ def __init__(
self.platform_instance_resolver = platform_instance_resolver
@abstractmethod
- def create_dataplatform_tables(
+ def create_lineage(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
+ ) -> Lineage:
pass
@abstractmethod
@@ -147,7 +157,7 @@ def get_db_detail_from_argument(
def parse_custom_sql(
self, query: str, server: str, database: Optional[str], schema: Optional[str]
- ) -> List[DataPlatformTable]:
+ ) -> Lineage:
dataplatform_tables: List[DataPlatformTable] = []
@@ -174,7 +184,7 @@ def parse_custom_sql(
if parsed_result is None:
logger.debug("Failed to parse query")
- return dataplatform_tables
+ return Lineage.empty()
for urn in parsed_result.in_tables:
dataplatform_tables.append(
@@ -184,9 +194,15 @@ def parse_custom_sql(
)
)
+ logger.debug(f"Native Query parsed result={parsed_result}")
logger.debug(f"Generated dataplatform_tables={dataplatform_tables}")
- return dataplatform_tables
+ return Lineage(
+ upstreams=dataplatform_tables,
+ column_lineage=parsed_result.column_lineage
+ if parsed_result.column_lineage is not None
+ else [],
+ )
class AbstractDataAccessMQueryResolver(ABC):
@@ -215,7 +231,7 @@ def resolve_to_data_platform_table_list(
ctx: PipelineContext,
config: PowerBiDashboardSourceConfig,
platform_instance_resolver: AbstractDataPlatformInstanceResolver,
- ) -> List[DataPlatformTable]:
+ ) -> List[Lineage]:
pass
@@ -471,8 +487,8 @@ def resolve_to_data_platform_table_list(
ctx: PipelineContext,
config: PowerBiDashboardSourceConfig,
platform_instance_resolver: AbstractDataPlatformInstanceResolver,
- ) -> List[DataPlatformTable]:
- data_platform_tables: List[DataPlatformTable] = []
+ ) -> List[Lineage]:
+ lineage: List[Lineage] = []
# Find out output variable as we are doing backtracking in M-Query
output_variable: Optional[str] = tree_function.get_output_variable(
@@ -484,7 +500,7 @@ def resolve_to_data_platform_table_list(
f"{self.table.full_name}-output-variable",
"output-variable not found in table expression",
)
- return data_platform_tables
+ return lineage
# Parse M-Query and use output_variable as root of tree and create instance of DataAccessFunctionDetail
table_links: List[
@@ -509,7 +525,7 @@ def resolve_to_data_platform_table_list(
# From supported_resolver enum get respective resolver like AmazonRedshift or Snowflake or Oracle or NativeQuery and create instance of it
# & also pass additional information that will be need to generate urn
- table_full_name_creator: AbstractDataPlatformTableCreator = (
+ table_qualified_name_creator: AbstractDataPlatformTableCreator = (
supported_resolver.get_table_full_name_creator()(
ctx=ctx,
config=config,
@@ -517,11 +533,9 @@ def resolve_to_data_platform_table_list(
)
)
- data_platform_tables.extend(
- table_full_name_creator.create_dataplatform_tables(f_detail)
- )
+ lineage.append(table_qualified_name_creator.create_lineage(f_detail))
- return data_platform_tables
+ return lineage
class DefaultTwoStepDataAccessSources(AbstractDataPlatformTableCreator, ABC):
@@ -536,7 +550,7 @@ class DefaultTwoStepDataAccessSources(AbstractDataPlatformTableCreator, ABC):
def two_level_access_pattern(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
+ ) -> Lineage:
logger.debug(
f"Processing {self.get_platform_pair().powerbi_data_platform_name} data-access function detail {data_access_func_detail}"
)
@@ -545,7 +559,7 @@ def two_level_access_pattern(
data_access_func_detail.arg_list
)
if server is None or db_name is None:
- return [] # Return empty list
+ return Lineage.empty() # Return empty list
schema_name: str = cast(
IdentifierAccessor, data_access_func_detail.identifier_accessor
@@ -568,19 +582,21 @@ def two_level_access_pattern(
server=server,
qualified_table_name=qualified_table_name,
)
-
- return [
- DataPlatformTable(
- data_platform_pair=self.get_platform_pair(),
- urn=urn,
- )
- ]
+ return Lineage(
+ upstreams=[
+ DataPlatformTable(
+ data_platform_pair=self.get_platform_pair(),
+ urn=urn,
+ )
+ ],
+ column_lineage=[],
+ )
class PostgresDataPlatformTableCreator(DefaultTwoStepDataAccessSources):
- def create_dataplatform_tables(
+ def create_lineage(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
+ ) -> Lineage:
return self.two_level_access_pattern(data_access_func_detail)
def get_platform_pair(self) -> DataPlatformPair:
@@ -630,10 +646,10 @@ def create_urn_using_old_parser(
return dataplatform_tables
- def create_dataplatform_tables(
+ def create_lineage(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
- dataplatform_tables: List[DataPlatformTable] = []
+ ) -> Lineage:
+
arguments: List[str] = tree_function.strip_char_from_list(
values=tree_function.remove_whitespaces_from_list(
tree_function.token_values(data_access_func_detail.arg_list)
@@ -647,14 +663,17 @@ def create_dataplatform_tables(
if len(arguments) >= 4 and arguments[2] != "Query":
logger.debug("Unsupported case is found. Second index is not the Query")
- return dataplatform_tables
+ return Lineage.empty()
if self.config.enable_advance_lineage_sql_construct is False:
# Use previous parser to generate URN to keep backward compatibility
- return self.create_urn_using_old_parser(
- query=arguments[3],
- db_name=arguments[1],
- server=arguments[0],
+ return Lineage(
+ upstreams=self.create_urn_using_old_parser(
+ query=arguments[3],
+ db_name=arguments[1],
+ server=arguments[0],
+ ),
+ column_lineage=[],
)
return self.parse_custom_sql(
@@ -684,9 +703,9 @@ def _get_server_and_db_name(value: str) -> Tuple[Optional[str], Optional[str]]:
return tree_function.strip_char_from_list([splitter_result[0]])[0], db_name
- def create_dataplatform_tables(
+ def create_lineage(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
+ ) -> Lineage:
logger.debug(
f"Processing Oracle data-access function detail {data_access_func_detail}"
)
@@ -698,7 +717,7 @@ def create_dataplatform_tables(
server, db_name = self._get_server_and_db_name(arguments[0])
if db_name is None or server is None:
- return []
+ return Lineage.empty()
schema_name: str = cast(
IdentifierAccessor, data_access_func_detail.identifier_accessor
@@ -719,18 +738,21 @@ def create_dataplatform_tables(
qualified_table_name=qualified_table_name,
)
- return [
- DataPlatformTable(
- data_platform_pair=self.get_platform_pair(),
- urn=urn,
- )
- ]
+ return Lineage(
+ upstreams=[
+ DataPlatformTable(
+ data_platform_pair=self.get_platform_pair(),
+ urn=urn,
+ )
+ ],
+ column_lineage=[],
+ )
class DatabrickDataPlatformTableCreator(AbstractDataPlatformTableCreator):
- def create_dataplatform_tables(
+ def create_lineage(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
+ ) -> Lineage:
logger.debug(
f"Processing Databrick data-access function detail {data_access_func_detail}"
)
@@ -749,7 +771,7 @@ def create_dataplatform_tables(
logger.debug(
"expecting instance to be IdentifierAccessor, please check if parsing is done properly"
)
- return []
+ return Lineage.empty()
db_name: str = value_dict["Database"]
schema_name: str = value_dict["Schema"]
@@ -762,7 +784,7 @@ def create_dataplatform_tables(
logger.info(
f"server information is not available for {qualified_table_name}. Skipping upstream table"
)
- return []
+ return Lineage.empty()
urn = urn_creator(
config=self.config,
@@ -772,12 +794,15 @@ def create_dataplatform_tables(
qualified_table_name=qualified_table_name,
)
- return [
- DataPlatformTable(
- data_platform_pair=self.get_platform_pair(),
- urn=urn,
- )
- ]
+ return Lineage(
+ upstreams=[
+ DataPlatformTable(
+ data_platform_pair=self.get_platform_pair(),
+ urn=urn,
+ )
+ ],
+ column_lineage=[],
+ )
def get_platform_pair(self) -> DataPlatformPair:
return SupportedDataPlatform.DATABRICK_SQL.value
@@ -789,9 +814,9 @@ def get_datasource_server(
) -> str:
return tree_function.strip_char_from_list([arguments[0]])[0]
- def create_dataplatform_tables(
+ def create_lineage(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
+ ) -> Lineage:
logger.debug(
f"Processing {self.get_platform_pair().datahub_data_platform_name} function detail {data_access_func_detail}"
)
@@ -826,12 +851,15 @@ def create_dataplatform_tables(
qualified_table_name=qualified_table_name,
)
- return [
- DataPlatformTable(
- data_platform_pair=self.get_platform_pair(),
- urn=urn,
- )
- ]
+ return Lineage(
+ upstreams=[
+ DataPlatformTable(
+ data_platform_pair=self.get_platform_pair(),
+ urn=urn,
+ )
+ ],
+ column_lineage=[],
+ )
class SnowflakeDataPlatformTableCreator(DefaultThreeStepDataAccessSources):
@@ -859,9 +887,9 @@ class AmazonRedshiftDataPlatformTableCreator(AbstractDataPlatformTableCreator):
def get_platform_pair(self) -> DataPlatformPair:
return SupportedDataPlatform.AMAZON_REDSHIFT.value
- def create_dataplatform_tables(
+ def create_lineage(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
+ ) -> Lineage:
logger.debug(
f"Processing AmazonRedshift data-access function detail {data_access_func_detail}"
)
@@ -870,7 +898,7 @@ def create_dataplatform_tables(
data_access_func_detail.arg_list
)
if db_name is None or server is None:
- return [] # Return empty list
+ return Lineage.empty() # Return empty list
schema_name: str = cast(
IdentifierAccessor, data_access_func_detail.identifier_accessor
@@ -891,12 +919,15 @@ def create_dataplatform_tables(
qualified_table_name=qualified_table_name,
)
- return [
- DataPlatformTable(
- data_platform_pair=self.get_platform_pair(),
- urn=urn,
- )
- ]
+ return Lineage(
+ upstreams=[
+ DataPlatformTable(
+ data_platform_pair=self.get_platform_pair(),
+ urn=urn,
+ )
+ ],
+ column_lineage=[],
+ )
class NativeQueryDataPlatformTableCreator(AbstractDataPlatformTableCreator):
@@ -916,9 +947,7 @@ def is_native_parsing_supported(data_access_function_name: str) -> bool:
in NativeQueryDataPlatformTableCreator.SUPPORTED_NATIVE_QUERY_DATA_PLATFORM
)
- def create_urn_using_old_parser(
- self, query: str, server: str
- ) -> List[DataPlatformTable]:
+ def create_urn_using_old_parser(self, query: str, server: str) -> Lineage:
dataplatform_tables: List[DataPlatformTable] = []
tables: List[str] = native_sql_parser.get_tables(query)
@@ -947,12 +976,14 @@ def create_urn_using_old_parser(
logger.debug(f"Generated dataplatform_tables {dataplatform_tables}")
- return dataplatform_tables
+ return Lineage(
+ upstreams=dataplatform_tables,
+ column_lineage=[],
+ )
- def create_dataplatform_tables(
+ def create_lineage(
self, data_access_func_detail: DataAccessFunctionDetail
- ) -> List[DataPlatformTable]:
- dataplatform_tables: List[DataPlatformTable] = []
+ ) -> Lineage:
t1: Tree = cast(
Tree, tree_function.first_arg_list_func(data_access_func_detail.arg_list)
)
@@ -963,7 +994,7 @@ def create_dataplatform_tables(
f"Expecting 2 argument, actual argument count is {len(flat_argument_list)}"
)
logger.debug(f"Flat argument list = {flat_argument_list}")
- return dataplatform_tables
+ return Lineage.empty()
data_access_tokens: List[str] = tree_function.remove_whitespaces_from_list(
tree_function.token_values(flat_argument_list[0])
)
@@ -981,7 +1012,7 @@ def create_dataplatform_tables(
f"Server is not available in argument list for data-platform {data_access_tokens[0]}. Returning empty "
"list"
)
- return dataplatform_tables
+ return Lineage.empty()
self.current_data_platform = self.SUPPORTED_NATIVE_QUERY_DATA_PLATFORM[
data_access_tokens[0]
diff --git a/metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py b/metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py
index 5d477ee090e7e..52bcef66658c8 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py
@@ -44,6 +44,11 @@
StatefulIngestionSourceBase,
)
from datahub.metadata.com.linkedin.pegasus2avro.common import ChangeAuditStamps
+from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
+ FineGrainedLineage,
+ FineGrainedLineageDownstreamType,
+ FineGrainedLineageUpstreamType,
+)
from datahub.metadata.schema_classes import (
BrowsePathsClass,
ChangeTypeClass,
@@ -71,6 +76,7 @@
ViewPropertiesClass,
)
from datahub.utilities.dedup_list import deduplicate_list
+from datahub.utilities.sqlglot_lineage import ColumnLineageInfo
# Logger instance
logger = logging.getLogger(__name__)
@@ -165,6 +171,48 @@ def extract_dataset_schema(
)
return [schema_mcp]
+ def make_fine_grained_lineage_class(
+ self, lineage: resolver.Lineage, dataset_urn: str
+ ) -> List[FineGrainedLineage]:
+ fine_grained_lineages: List[FineGrainedLineage] = []
+
+ if (
+ self.__config.extract_column_level_lineage is False
+ or self.__config.extract_lineage is False
+ ):
+ return fine_grained_lineages
+
+ if lineage is None:
+ return fine_grained_lineages
+
+ logger.info("Extracting column level lineage")
+
+ cll: List[ColumnLineageInfo] = lineage.column_lineage
+
+ for cll_info in cll:
+ downstream = (
+ [builder.make_schema_field_urn(dataset_urn, cll_info.downstream.column)]
+ if cll_info.downstream is not None
+ and cll_info.downstream.column is not None
+ else []
+ )
+
+ upstreams = [
+ builder.make_schema_field_urn(column_ref.table, column_ref.column)
+ for column_ref in cll_info.upstreams
+ ]
+
+ fine_grained_lineages.append(
+ FineGrainedLineage(
+ downstreamType=FineGrainedLineageDownstreamType.FIELD,
+ downstreams=downstream,
+ upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
+ upstreams=upstreams,
+ )
+ )
+
+ return fine_grained_lineages
+
def extract_lineage(
self, table: powerbi_data_classes.Table, ds_urn: str
) -> List[MetadataChangeProposalWrapper]:
@@ -174,8 +222,9 @@ def extract_lineage(
parameters = table.dataset.parameters if table.dataset else {}
upstream: List[UpstreamClass] = []
+ cll_lineage: List[FineGrainedLineage] = []
- upstream_dpts: List[resolver.DataPlatformTable] = parser.get_upstream_tables(
+ upstream_lineage: List[resolver.Lineage] = parser.get_upstream_tables(
table=table,
reporter=self.__reporter,
platform_instance_resolver=self.__dataplatform_instance_resolver,
@@ -185,34 +234,49 @@ def extract_lineage(
)
logger.debug(
- f"PowerBI virtual table {table.full_name} and it's upstream dataplatform tables = {upstream_dpts}"
+ f"PowerBI virtual table {table.full_name} and it's upstream dataplatform tables = {upstream_lineage}"
)
- for upstream_dpt in upstream_dpts:
- if (
- upstream_dpt.data_platform_pair.powerbi_data_platform_name
- not in self.__config.dataset_type_mapping.keys()
- ):
- logger.debug(
- f"Skipping upstream table for {ds_urn}. The platform {upstream_dpt.data_platform_pair.powerbi_data_platform_name} is not part of dataset_type_mapping",
+ for lineage in upstream_lineage:
+ for upstream_dpt in lineage.upstreams:
+ if (
+ upstream_dpt.data_platform_pair.powerbi_data_platform_name
+ not in self.__config.dataset_type_mapping.keys()
+ ):
+ logger.debug(
+ f"Skipping upstream table for {ds_urn}. The platform {upstream_dpt.data_platform_pair.powerbi_data_platform_name} is not part of dataset_type_mapping",
+ )
+ continue
+
+ upstream_table_class = UpstreamClass(
+ upstream_dpt.urn,
+ DatasetLineageTypeClass.TRANSFORMED,
)
- continue
- upstream_table_class = UpstreamClass(
- upstream_dpt.urn,
- DatasetLineageTypeClass.TRANSFORMED,
- )
+ upstream.append(upstream_table_class)
- upstream.append(upstream_table_class)
+ # Add column level lineage if any
+ cll_lineage.extend(
+ self.make_fine_grained_lineage_class(
+ lineage=lineage,
+ dataset_urn=ds_urn,
+ )
+ )
if len(upstream) > 0:
- upstream_lineage = UpstreamLineageClass(upstreams=upstream)
+
+ upstream_lineage_class: UpstreamLineageClass = UpstreamLineageClass(
+ upstreams=upstream,
+ fineGrainedLineages=cll_lineage or None,
+ )
+
logger.debug(f"Dataset urn = {ds_urn} and its lineage = {upstream_lineage}")
+
mcp = MetadataChangeProposalWrapper(
entityType=Constant.DATASET,
changeType=ChangeTypeClass.UPSERT,
entityUrn=ds_urn,
- aspect=upstream_lineage,
+ aspect=upstream_lineage_class,
)
mcps.append(mcp)
@@ -1075,6 +1139,10 @@ def report_to_datahub_work_units(
SourceCapability.OWNERSHIP,
"Disabled by default, configured using `extract_ownership`",
)
+@capability(
+ SourceCapability.LINEAGE_FINE,
+ "Disabled by default, configured using `extract_column_level_lineage`. ",
+)
class PowerBiDashboardSource(StatefulIngestionSourceBase):
"""
This plugin extracts the following:
diff --git a/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py b/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py
index 95f6444384408..032bdef178fdf 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py
@@ -101,8 +101,8 @@ class SnowflakeV2Config(
)
include_view_column_lineage: bool = Field(
- default=False,
- description="Populates view->view and table->view column lineage.",
+ default=True,
+ description="Populates view->view and table->view column lineage using DataHub's sql parser.",
)
_check_role_grants_removed = pydantic_removed_field("check_role_grants")
diff --git a/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py b/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py
index 240e0ffa1a0b6..215116b4c33fb 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py
@@ -301,14 +301,11 @@ def __init__(self, ctx: PipelineContext, config: SnowflakeV2Config):
# Caches tables for a single database. Consider moving to disk or S3 when possible.
self.db_tables: Dict[str, List[SnowflakeTable]] = {}
- self.sql_parser_schema_resolver = SchemaResolver(
- platform=self.platform,
- platform_instance=self.config.platform_instance,
- env=self.config.env,
- )
self.view_definitions: FileBackedDict[str] = FileBackedDict()
self.add_config_to_report()
+ self.sql_parser_schema_resolver = self._init_schema_resolver()
+
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Source":
config = SnowflakeV2Config.parse_obj(config_dict)
@@ -493,6 +490,24 @@ def query(query):
return _report
+ def _init_schema_resolver(self) -> SchemaResolver:
+ if not self.config.include_technical_schema and self.config.parse_view_ddl:
+ if self.ctx.graph:
+ return self.ctx.graph.initialize_schema_resolver_from_datahub(
+ platform=self.platform,
+ platform_instance=self.config.platform_instance,
+ env=self.config.env,
+ )
+ else:
+ logger.warning(
+ "Failed to load schema info from DataHub as DataHubGraph is missing.",
+ )
+ return SchemaResolver(
+ platform=self.platform,
+ platform_instance=self.config.platform_instance,
+ env=self.config.env,
+ )
+
def get_workunit_processors(self) -> List[Optional[MetadataWorkUnitProcessor]]:
return [
*super().get_workunit_processors(),
@@ -764,7 +779,7 @@ def _process_schema(
)
self.db_tables[schema_name] = tables
- if self.config.include_technical_schema or self.config.parse_view_ddl:
+ if self.config.include_technical_schema:
for table in tables:
yield from self._process_table(table, schema_name, db_name)
@@ -776,7 +791,7 @@ def _process_schema(
if view.view_definition:
self.view_definitions[key] = view.view_definition
- if self.config.include_technical_schema or self.config.parse_view_ddl:
+ if self.config.include_technical_schema:
for view in views:
yield from self._process_view(view, schema_name, db_name)
@@ -892,8 +907,6 @@ def _process_table(
yield from self._process_tag(tag)
yield from self.gen_dataset_workunits(table, schema_name, db_name)
- elif self.config.parse_view_ddl:
- self.gen_schema_metadata(table, schema_name, db_name)
def fetch_sample_data_for_classification(
self, table: SnowflakeTable, schema_name: str, db_name: str, dataset_name: str
@@ -1004,8 +1017,6 @@ def _process_view(
yield from self._process_tag(tag)
yield from self.gen_dataset_workunits(view, schema_name, db_name)
- elif self.config.parse_view_ddl:
- self.gen_schema_metadata(view, schema_name, db_name)
def _process_tag(self, tag: SnowflakeTag) -> Iterable[MetadataWorkUnit]:
tag_identifier = tag.identifier()
diff --git a/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py b/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py
index 112defe76d957..056be6c2e50ac 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py
@@ -1,12 +1,10 @@
import datetime
import logging
import traceback
-from collections import OrderedDict
from dataclasses import dataclass, field
from typing import (
TYPE_CHECKING,
Any,
- Callable,
Dict,
Iterable,
List,
@@ -103,52 +101,6 @@
MISSING_COLUMN_INFO = "missing column information"
-def _platform_alchemy_uri_tester_gen(
- platform: str, opt_starts_with: Optional[str] = None
-) -> Tuple[str, Callable[[str], bool]]:
- return platform, lambda x: x.startswith(
- platform if not opt_starts_with else opt_starts_with
- )
-
-
-PLATFORM_TO_SQLALCHEMY_URI_TESTER_MAP: Dict[str, Callable[[str], bool]] = OrderedDict(
- [
- _platform_alchemy_uri_tester_gen("athena", "awsathena"),
- _platform_alchemy_uri_tester_gen("bigquery"),
- _platform_alchemy_uri_tester_gen("clickhouse"),
- _platform_alchemy_uri_tester_gen("druid"),
- _platform_alchemy_uri_tester_gen("hana"),
- _platform_alchemy_uri_tester_gen("hive"),
- _platform_alchemy_uri_tester_gen("mongodb"),
- _platform_alchemy_uri_tester_gen("mssql"),
- _platform_alchemy_uri_tester_gen("mysql"),
- _platform_alchemy_uri_tester_gen("oracle"),
- _platform_alchemy_uri_tester_gen("pinot"),
- _platform_alchemy_uri_tester_gen("presto"),
- (
- "redshift",
- lambda x: (
- x.startswith(("jdbc:postgres:", "postgresql"))
- and x.find("redshift.amazonaws") > 0
- )
- or x.startswith("redshift"),
- ),
- # Don't move this before redshift.
- _platform_alchemy_uri_tester_gen("postgres", "postgresql"),
- _platform_alchemy_uri_tester_gen("snowflake"),
- _platform_alchemy_uri_tester_gen("trino"),
- _platform_alchemy_uri_tester_gen("vertica"),
- ]
-)
-
-
-def get_platform_from_sqlalchemy_uri(sqlalchemy_uri: str) -> str:
- for platform, tester in PLATFORM_TO_SQLALCHEMY_URI_TESTER_MAP.items():
- if tester(sqlalchemy_uri):
- return platform
- return "external"
-
-
@dataclass
class SQLSourceReport(StaleEntityRemovalSourceReport):
tables_scanned: int = 0
diff --git a/metadata-ingestion/src/datahub/ingestion/source/sql/sqlalchemy_uri_mapper.py b/metadata-ingestion/src/datahub/ingestion/source/sql/sqlalchemy_uri_mapper.py
new file mode 100644
index 0000000000000..b6a463837228d
--- /dev/null
+++ b/metadata-ingestion/src/datahub/ingestion/source/sql/sqlalchemy_uri_mapper.py
@@ -0,0 +1,47 @@
+from collections import OrderedDict
+from typing import Callable, Dict, Optional, Tuple
+
+
+def _platform_alchemy_uri_tester_gen(
+ platform: str, opt_starts_with: Optional[str] = None
+) -> Tuple[str, Callable[[str], bool]]:
+ return platform, lambda x: x.startswith(opt_starts_with or platform)
+
+
+PLATFORM_TO_SQLALCHEMY_URI_TESTER_MAP: Dict[str, Callable[[str], bool]] = OrderedDict(
+ [
+ _platform_alchemy_uri_tester_gen("athena", "awsathena"),
+ _platform_alchemy_uri_tester_gen("bigquery"),
+ _platform_alchemy_uri_tester_gen("clickhouse"),
+ _platform_alchemy_uri_tester_gen("druid"),
+ _platform_alchemy_uri_tester_gen("hana"),
+ _platform_alchemy_uri_tester_gen("hive"),
+ _platform_alchemy_uri_tester_gen("mongodb"),
+ _platform_alchemy_uri_tester_gen("mssql"),
+ _platform_alchemy_uri_tester_gen("mysql"),
+ _platform_alchemy_uri_tester_gen("oracle"),
+ _platform_alchemy_uri_tester_gen("pinot"),
+ _platform_alchemy_uri_tester_gen("presto"),
+ (
+ "redshift",
+ lambda x: (
+ x.startswith(("jdbc:postgres:", "postgresql"))
+ and x.find("redshift.amazonaws") > 0
+ )
+ or x.startswith("redshift"),
+ ),
+ # Don't move this before redshift.
+ _platform_alchemy_uri_tester_gen("postgres", "postgresql"),
+ _platform_alchemy_uri_tester_gen("snowflake"),
+ _platform_alchemy_uri_tester_gen("sqlite"),
+ _platform_alchemy_uri_tester_gen("trino"),
+ _platform_alchemy_uri_tester_gen("vertica"),
+ ]
+)
+
+
+def get_platform_from_sqlalchemy_uri(sqlalchemy_uri: str) -> str:
+ for platform, tester in PLATFORM_TO_SQLALCHEMY_URI_TESTER_MAP.items():
+ if tester(sqlalchemy_uri):
+ return platform
+ return "external"
diff --git a/metadata-ingestion/src/datahub/ingestion/source/sql_queries.py b/metadata-ingestion/src/datahub/ingestion/source/sql_queries.py
index 2fcc93292c2ef..bce4d1ec76e6e 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/sql_queries.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/sql_queries.py
@@ -103,13 +103,12 @@ def __init__(self, ctx: PipelineContext, config: SqlQueriesSourceConfig):
self.builder = SqlParsingBuilder(usage_config=self.config.usage)
if self.config.use_schema_resolver:
- schema_resolver, urns = self.graph.initialize_schema_resolver_from_datahub(
+ self.schema_resolver = self.graph.initialize_schema_resolver_from_datahub(
platform=self.config.platform,
platform_instance=self.config.platform_instance,
env=self.config.env,
)
- self.schema_resolver = schema_resolver
- self.urns = urns
+ self.urns = self.schema_resolver.get_urns()
else:
self.schema_resolver = self.graph._make_schema_resolver(
platform=self.config.platform,
diff --git a/metadata-ingestion/src/datahub/ingestion/source/superset.py b/metadata-ingestion/src/datahub/ingestion/source/superset.py
index 2a4563439b6ba..14bc4242d2a91 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/superset.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/superset.py
@@ -21,7 +21,9 @@
)
from datahub.ingestion.api.source import MetadataWorkUnitProcessor, Source
from datahub.ingestion.api.workunit import MetadataWorkUnit
-from datahub.ingestion.source.sql import sql_common
+from datahub.ingestion.source.sql.sqlalchemy_uri_mapper import (
+ get_platform_from_sqlalchemy_uri,
+)
from datahub.ingestion.source.state.stale_entity_removal_handler import (
StaleEntityRemovalHandler,
StaleEntityRemovalSourceReport,
@@ -202,7 +204,7 @@ def get_platform_from_database_id(self, database_id):
sqlalchemy_uri = database_response.get("result", {}).get("sqlalchemy_uri")
if sqlalchemy_uri is None:
return database_response.get("result", {}).get("backend", "external")
- return sql_common.get_platform_from_sqlalchemy_uri(sqlalchemy_uri)
+ return get_platform_from_sqlalchemy_uri(sqlalchemy_uri)
@lru_cache(maxsize=None)
def get_datasource_urn_from_id(self, datasource_id):
diff --git a/metadata-ingestion/src/datahub/ingestion/source/tableau.py b/metadata-ingestion/src/datahub/ingestion/source/tableau.py
index 4cc00a66116e9..6214cba342622 100644
--- a/metadata-ingestion/src/datahub/ingestion/source/tableau.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/tableau.py
@@ -1179,8 +1179,6 @@ def get_upstream_fields_of_field_in_datasource(
def get_upstream_fields_from_custom_sql(
self, datasource: dict, datasource_urn: str
) -> List[FineGrainedLineage]:
- fine_grained_lineages: List[FineGrainedLineage] = []
-
parsed_result = self.parse_custom_sql(
datasource=datasource,
datasource_urn=datasource_urn,
@@ -1194,13 +1192,20 @@ def get_upstream_fields_from_custom_sql(
logger.info(
f"Failed to extract column level lineage from datasource {datasource_urn}"
)
- return fine_grained_lineages
+ return []
+ if parsed_result.debug_info.error:
+ logger.info(
+ f"Failed to extract column level lineage from datasource {datasource_urn}: {parsed_result.debug_info.error}"
+ )
+ return []
cll: List[ColumnLineageInfo] = (
parsed_result.column_lineage
if parsed_result.column_lineage is not None
else []
)
+
+ fine_grained_lineages: List[FineGrainedLineage] = []
for cll_info in cll:
downstream = (
[
diff --git a/metadata-ingestion/src/datahub/ingestion/transformer/extract_ownership_from_tags.py b/metadata-ingestion/src/datahub/ingestion/transformer/extract_ownership_from_tags.py
new file mode 100644
index 0000000000000..64f70988ea3a7
--- /dev/null
+++ b/metadata-ingestion/src/datahub/ingestion/transformer/extract_ownership_from_tags.py
@@ -0,0 +1,91 @@
+import re
+from functools import lru_cache
+from typing import List, Optional, cast
+
+from datahub.configuration.common import TransformerSemanticsConfigModel
+from datahub.emitter.mce_builder import Aspect
+from datahub.ingestion.api.common import PipelineContext
+from datahub.ingestion.transformer.dataset_transformer import DatasetTagsTransformer
+from datahub.metadata.schema_classes import (
+ GlobalTagsClass,
+ OwnerClass,
+ OwnershipClass,
+ OwnershipTypeClass,
+)
+from datahub.utilities.urns.corp_group_urn import CorpGroupUrn
+from datahub.utilities.urns.corpuser_urn import CorpuserUrn
+from datahub.utilities.urns.tag_urn import TagUrn
+
+
+class ExtractOwnersFromTagsConfig(TransformerSemanticsConfigModel):
+ tag_prefix: str
+ is_user: bool = True
+ email_domain: Optional[str] = None
+ owner_type: str = "TECHNICAL_OWNER"
+ owner_type_urn: Optional[str] = None
+
+
+@lru_cache(maxsize=10)
+def get_owner_type(owner_type_str: str) -> str:
+ for item in dir(OwnershipTypeClass):
+ if str(item) == owner_type_str:
+ return item
+ return OwnershipTypeClass.CUSTOM
+
+
+class ExtractOwnersFromTagsTransformer(DatasetTagsTransformer):
+ """Transformer that can be used to set extract ownership from entity tags (currently does not support column level tags)"""
+
+ ctx: PipelineContext
+ config: ExtractOwnersFromTagsConfig
+
+ def __init__(self, config: ExtractOwnersFromTagsConfig, ctx: PipelineContext):
+ super().__init__()
+ self.ctx = ctx
+ self.config = config
+
+ @classmethod
+ def create(
+ cls, config_dict: dict, ctx: PipelineContext
+ ) -> "ExtractOwnersFromTagsTransformer":
+ config = ExtractOwnersFromTagsConfig.parse_obj(config_dict)
+ return cls(config, ctx)
+
+ def get_owner_urn(self, owner_str: str) -> str:
+ if self.config.email_domain is not None:
+ return owner_str + "@" + self.config.email_domain
+ return owner_str
+
+ def transform_aspect(
+ self, entity_urn: str, aspect_name: str, aspect: Optional[Aspect]
+ ) -> Optional[Aspect]:
+ in_tags_aspect: Optional[GlobalTagsClass] = cast(GlobalTagsClass, aspect)
+ if in_tags_aspect is None:
+ return None
+ tags = in_tags_aspect.tags
+ owners: List[OwnerClass] = []
+ for tag_class in tags:
+ tag_urn = TagUrn.create_from_string(tag_class.tag)
+ tag_str = tag_urn.get_entity_id()[0]
+ re_match = re.search(self.config.tag_prefix, tag_str)
+ if re_match:
+ owner_str = tag_str[re_match.end() :].strip()
+ owner_urn_str = self.get_owner_urn(owner_str)
+ if self.config.is_user:
+ owner_urn = str(CorpuserUrn.create_from_id(owner_urn_str))
+ else:
+ owner_urn = str(CorpGroupUrn.create_from_id(owner_urn_str))
+ owner_type = get_owner_type(self.config.owner_type)
+ if owner_type == OwnershipTypeClass.CUSTOM:
+ assert (
+ self.config.owner_type_urn is not None
+ ), "owner_type_urn must be set if owner_type is CUSTOM"
+ owner = OwnerClass(
+ owner=owner_urn,
+ type=owner_type,
+ typeUrn=self.config.owner_type_urn,
+ )
+ owners.append(owner)
+
+ owner_aspect = OwnershipClass(owners=owners)
+ return cast(Aspect, owner_aspect)
diff --git a/metadata-ingestion/src/datahub/integrations/great_expectations/action.py b/metadata-ingestion/src/datahub/integrations/great_expectations/action.py
index eabf62a4cda2b..f116550328819 100644
--- a/metadata-ingestion/src/datahub/integrations/great_expectations/action.py
+++ b/metadata-ingestion/src/datahub/integrations/great_expectations/action.py
@@ -35,7 +35,9 @@
from datahub.cli.cli_utils import get_boolean_env_variable
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
-from datahub.ingestion.source.sql.sql_common import get_platform_from_sqlalchemy_uri
+from datahub.ingestion.source.sql.sqlalchemy_uri_mapper import (
+ get_platform_from_sqlalchemy_uri,
+)
from datahub.metadata.com.linkedin.pegasus2avro.assertion import (
AssertionInfo,
AssertionResult,
diff --git a/metadata-ingestion/src/datahub/testing/compare_metadata_json.py b/metadata-ingestion/src/datahub/testing/compare_metadata_json.py
index 5c52e1ab4f0b3..54f6a6e984c00 100644
--- a/metadata-ingestion/src/datahub/testing/compare_metadata_json.py
+++ b/metadata-ingestion/src/datahub/testing/compare_metadata_json.py
@@ -40,6 +40,7 @@ def assert_metadata_files_equal(
update_golden: bool,
copy_output: bool,
ignore_paths: Sequence[str] = (),
+ ignore_order: bool = True,
) -> None:
golden_exists = os.path.isfile(golden_path)
@@ -65,7 +66,7 @@ def assert_metadata_files_equal(
write_metadata_file(pathlib.Path(temp.name), golden_metadata)
golden = load_json_file(temp.name)
- diff = diff_metadata_json(output, golden, ignore_paths)
+ diff = diff_metadata_json(output, golden, ignore_paths, ignore_order=ignore_order)
if diff and update_golden:
if isinstance(diff, MCPDiff):
diff.apply_delta(golden)
@@ -91,16 +92,19 @@ def diff_metadata_json(
output: MetadataJson,
golden: MetadataJson,
ignore_paths: Sequence[str] = (),
+ ignore_order: bool = True,
) -> Union[DeepDiff, MCPDiff]:
ignore_paths = (*ignore_paths, *default_exclude_paths, r"root\[\d+].delta_info")
try:
- golden_map = get_aspects_by_urn(golden)
- output_map = get_aspects_by_urn(output)
- return MCPDiff.create(
- golden=golden_map,
- output=output_map,
- ignore_paths=ignore_paths,
- )
+ if ignore_order:
+ golden_map = get_aspects_by_urn(golden)
+ output_map = get_aspects_by_urn(output)
+ return MCPDiff.create(
+ golden=golden_map,
+ output=output_map,
+ ignore_paths=ignore_paths,
+ )
+ # if ignore_order is False, always use DeepDiff
except CannotCompareMCPs as e:
logger.info(f"{e}, falling back to MCE diff")
except AssertionError as e:
@@ -111,5 +115,5 @@ def diff_metadata_json(
golden,
output,
exclude_regex_paths=ignore_paths,
- ignore_order=True,
+ ignore_order=ignore_order,
)
diff --git a/metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py b/metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py
index f18235af3d1fd..81c43884fdf7d 100644
--- a/metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py
+++ b/metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py
@@ -231,6 +231,13 @@ def _table_level_lineage(
# In some cases like "MERGE ... then INSERT (col1, col2) VALUES (col1, col2)",
# the `this` on the INSERT part isn't a table.
if isinstance(expr.this, sqlglot.exp.Table)
+ } | {
+ # For CREATE DDL statements, the table name is nested inside
+ # a Schema object.
+ _TableName.from_sqlglot_table(expr.this.this)
+ for expr in statement.find_all(sqlglot.exp.Create)
+ if isinstance(expr.this, sqlglot.exp.Schema)
+ and isinstance(expr.this.this, sqlglot.exp.Table)
}
tables = (
@@ -242,7 +249,7 @@ def _table_level_lineage(
- modified
# ignore CTEs created in this statement
- {
- _TableName(database=None, schema=None, table=cte.alias_or_name)
+ _TableName(database=None, db_schema=None, table=cte.alias_or_name)
for cte in statement.find_all(sqlglot.exp.CTE)
}
)
@@ -276,6 +283,9 @@ def __init__(
shared_connection=shared_conn,
)
+ def get_urns(self) -> Set[str]:
+ return set(self._schema_cache.keys())
+
def get_urn_for_table(self, table: _TableName, lower: bool = False) -> str:
# TODO: Validate that this is the correct 2/3 layer hierarchy for the platform.
@@ -390,8 +400,6 @@ def convert_graphql_schema_metadata_to_info(
)
}
- # TODO add a method to load all from graphql
-
def close(self) -> None:
self._schema_cache.close()
@@ -906,32 +914,39 @@ def create_lineage_sql_parsed_result(
env: str,
schema: Optional[str] = None,
graph: Optional[DataHubGraph] = None,
-) -> Optional["SqlParsingResult"]:
- parsed_result: Optional["SqlParsingResult"] = None
+) -> SqlParsingResult:
+ needs_close = False
try:
- schema_resolver = (
- graph._make_schema_resolver(
+ if graph:
+ schema_resolver = graph._make_schema_resolver(
platform=platform,
platform_instance=platform_instance,
env=env,
)
- if graph is not None
- else SchemaResolver(
+ else:
+ needs_close = True
+ schema_resolver = SchemaResolver(
platform=platform,
platform_instance=platform_instance,
env=env,
graph=None,
)
- )
- parsed_result = sqlglot_lineage(
+ return sqlglot_lineage(
query,
schema_resolver=schema_resolver,
default_db=database,
default_schema=schema,
)
except Exception as e:
- logger.debug(f"Fail to prase query {query}", exc_info=e)
- logger.warning("Fail to parse custom SQL")
-
- return parsed_result
+ return SqlParsingResult(
+ in_tables=[],
+ out_tables=[],
+ column_lineage=None,
+ debug_info=SqlParsingDebugInfo(
+ table_error=e,
+ ),
+ )
+ finally:
+ if needs_close:
+ schema_resolver.close()
diff --git a/metadata-ingestion/tests/conftest.py b/metadata-ingestion/tests/conftest.py
index 0eb9ab250339c..0f278ab1e1311 100644
--- a/metadata-ingestion/tests/conftest.py
+++ b/metadata-ingestion/tests/conftest.py
@@ -1,6 +1,8 @@
import logging
import os
+import pathlib
import time
+from typing import List
import pytest
@@ -49,3 +51,40 @@ def pytest_addoption(parser):
default=False,
)
parser.addoption("--copy-output-files", action="store_true", default=False)
+
+
+def pytest_collection_modifyitems(
+ config: pytest.Config, items: List[pytest.Item]
+) -> None:
+ # https://docs.pytest.org/en/latest/reference/reference.html#pytest.hookspec.pytest_collection_modifyitems
+ # Adapted from https://stackoverflow.com/a/57046943/5004662.
+
+ root = pathlib.Path(config.rootpath)
+ integration_path = root / "tests/integration"
+
+ for item in items:
+ test_path = pathlib.Path(item.fspath)
+
+ if (
+ "docker_compose_runner" in item.fixturenames # type: ignore[attr-defined]
+ or any(
+ marker.name == "integration_batch_2" for marker in item.iter_markers()
+ )
+ ):
+ item.add_marker(pytest.mark.slow)
+
+ is_already_integration = any(
+ marker.name == "integration" for marker in item.iter_markers()
+ )
+
+ if integration_path in test_path.parents or is_already_integration:
+ # If it doesn't have a marker yet, put it in integration_batch_0.
+ if not any(
+ marker.name.startswith("integration_batch_")
+ for marker in item.iter_markers()
+ ):
+ item.add_marker(pytest.mark.integration_batch_0)
+
+ # Mark everything as an integration test.
+ if not is_already_integration:
+ item.add_marker(pytest.mark.integration)
diff --git a/metadata-ingestion/tests/integration/business-glossary/test_business_glossary.py b/metadata-ingestion/tests/integration/business-glossary/test_business_glossary.py
index 11fed2a805565..b6e1aca4d4fed 100644
--- a/metadata-ingestion/tests/integration/business-glossary/test_business_glossary.py
+++ b/metadata-ingestion/tests/integration/business-glossary/test_business_glossary.py
@@ -1,4 +1,4 @@
-from typing import Any, Dict, List
+from typing import Any, Dict
import pytest
from freezegun import freeze_time
@@ -45,14 +45,6 @@ def test_glossary_ingest(
):
test_resources_dir = pytestconfig.rootpath / "tests/integration/business-glossary"
- # These paths change from one instance run of the clickhouse docker to the other,
- # and the FROZEN_TIME does not apply to these.
- ignore_paths: List[str] = [
- r"root\[\d+\]\['proposedSnapshot'\].+\['aspects'\].+\['customProperties'\]\['metadata_modification_time'\]",
- r"root\[\d+\]\['proposedSnapshot'\].+\['aspects'\].+\['customProperties'\]\['data_paths'\]",
- r"root\[\d+\]\['proposedSnapshot'\].+\['aspects'\].+\['customProperties'\]\['metadata_path'\]",
- ]
-
output_mces_path: str = f"{tmp_path}/glossary_events.json"
golden_mces_path: str = f"{test_resources_dir}/{golden_file}"
@@ -72,7 +64,6 @@ def test_glossary_ingest(
# Verify the output.
mce_helpers.check_golden_file(
pytestconfig,
- ignore_paths=ignore_paths,
output_path=output_mces_path,
golden_path=golden_mces_path,
)
diff --git a/metadata-ingestion/tests/integration/delta_lake/test_delta_lake_minio.py b/metadata-ingestion/tests/integration/delta_lake/test_delta_lake_minio.py
index 36ec1d317fec4..6146c6d1a948c 100644
--- a/metadata-ingestion/tests/integration/delta_lake/test_delta_lake_minio.py
+++ b/metadata-ingestion/tests/integration/delta_lake/test_delta_lake_minio.py
@@ -9,6 +9,8 @@
from tests.test_helpers import mce_helpers
from tests.test_helpers.docker_helpers import wait_for_port
+pytestmark = pytest.mark.integration_batch_2
+
FROZEN_TIME = "2020-04-14 07:00:00"
MINIO_PORT = 9000
@@ -64,7 +66,7 @@ def populate_minio(pytestconfig, s3_bkt):
pytestconfig.rootpath / "tests/integration/delta_lake/test_data/"
)
- for root, dirs, files in os.walk(test_resources_dir):
+ for root, _dirs, files in os.walk(test_resources_dir):
for file in files:
full_path = os.path.join(root, file)
rel_path = os.path.relpath(full_path, test_resources_dir)
@@ -72,7 +74,6 @@ def populate_minio(pytestconfig, s3_bkt):
yield
-@pytest.mark.slow_integration
@freezegun.freeze_time("2023-01-01 00:00:00+00:00")
def test_delta_lake_ingest(pytestconfig, tmp_path, test_resources_dir):
# Run the metadata ingestion pipeline.
diff --git a/metadata-ingestion/tests/integration/hana/test_hana.py b/metadata-ingestion/tests/integration/hana/test_hana.py
index 0fa234d059e5e..726f8744167db 100644
--- a/metadata-ingestion/tests/integration/hana/test_hana.py
+++ b/metadata-ingestion/tests/integration/hana/test_hana.py
@@ -7,12 +7,12 @@
from tests.test_helpers.click_helpers import run_datahub_cmd
from tests.test_helpers.docker_helpers import wait_for_port
+pytestmark = pytest.mark.integration_batch_2
FROZEN_TIME = "2020-04-14 07:00:00"
@freeze_time(FROZEN_TIME)
@pytest.mark.xfail # TODO: debug the flakes for this test
-@pytest.mark.slow_integration
@pytest.mark.skipif(
platform.machine().lower() == "aarch64",
reason="The hdbcli dependency is not available for aarch64",
diff --git a/metadata-ingestion/tests/integration/hive/test_hive.py b/metadata-ingestion/tests/integration/hive/test_hive.py
index ce166c3b336ac..caffb761380dd 100644
--- a/metadata-ingestion/tests/integration/hive/test_hive.py
+++ b/metadata-ingestion/tests/integration/hive/test_hive.py
@@ -12,6 +12,8 @@
data_platform = "hive"
+pytestmark = pytest.mark.integration_batch_1
+
@pytest.fixture(scope="module")
def hive_runner(docker_compose_runner, pytestconfig):
@@ -54,7 +56,6 @@ def base_pipeline_config(events_file, db=None):
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration_batch_1
def test_hive_ingest(
loaded_hive, pytestconfig, test_resources_dir, tmp_path, mock_time
):
@@ -110,7 +111,6 @@ def test_hive_ingest_all_db(
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration_batch_1
def test_hive_instance_check(loaded_hive, test_resources_dir, tmp_path, pytestconfig):
instance: str = "production_warehouse"
diff --git a/metadata-ingestion/tests/integration/iceberg/test_iceberg.py b/metadata-ingestion/tests/integration/iceberg/test_iceberg.py
index e2a86480672e5..65ede11c3f1c0 100644
--- a/metadata-ingestion/tests/integration/iceberg/test_iceberg.py
+++ b/metadata-ingestion/tests/integration/iceberg/test_iceberg.py
@@ -8,22 +8,31 @@
from tests.test_helpers import mce_helpers
from tests.test_helpers.click_helpers import run_datahub_cmd
-from tests.test_helpers.docker_helpers import wait_for_port
+from tests.test_helpers.docker_helpers import cleanup_image, wait_for_port
from tests.test_helpers.state_helpers import (
get_current_checkpoint_from_pipeline,
run_and_get_pipeline,
validate_all_providers_have_committed_successfully,
)
+pytestmark = [
+ pytest.mark.integration_batch_1,
+ # Skip tests if not on Python 3.8 or higher.
+ pytest.mark.skipif(
+ sys.version_info < (3, 8), reason="Requires python 3.8 or higher"
+ ),
+]
FROZEN_TIME = "2020-04-14 07:00:00"
GMS_PORT = 8080
GMS_SERVER = f"http://localhost:{GMS_PORT}"
-@pytest.fixture(autouse=True)
-def skip_tests_if_python_before_3_8():
- if sys.version_info < (3, 8):
- pytest.skip("Requires python 3.8 or higher")
+@pytest.fixture(autouse=True, scope="module")
+def remove_docker_image():
+ yield
+
+ # The tabulario/spark-iceberg image is pretty large, so we remove it after the test.
+ cleanup_image("tabulario/spark-iceberg")
def spark_submit(file_path: str, args: str = "") -> None:
@@ -36,7 +45,6 @@ def spark_submit(file_path: str, args: str = "") -> None:
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration
def test_iceberg_ingest(docker_compose_runner, pytestconfig, tmp_path, mock_time):
test_resources_dir = pytestconfig.rootpath / "tests/integration/iceberg/"
@@ -69,7 +77,6 @@ def test_iceberg_ingest(docker_compose_runner, pytestconfig, tmp_path, mock_time
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration
def test_iceberg_stateful_ingest(
docker_compose_runner, pytestconfig, tmp_path, mock_time, mock_datahub_graph
):
@@ -189,7 +196,6 @@ def test_iceberg_stateful_ingest(
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration
def test_iceberg_profiling(docker_compose_runner, pytestconfig, tmp_path, mock_time):
test_resources_dir = pytestconfig.rootpath / "tests/integration/iceberg/"
diff --git a/metadata-ingestion/tests/integration/kafka-connect/test_kafka_connect.py b/metadata-ingestion/tests/integration/kafka-connect/test_kafka_connect.py
index 48063908e624f..8cf76cfb26af7 100644
--- a/metadata-ingestion/tests/integration/kafka-connect/test_kafka_connect.py
+++ b/metadata-ingestion/tests/integration/kafka-connect/test_kafka_connect.py
@@ -1,5 +1,5 @@
import subprocess
-from typing import Any, Dict, List, cast
+from typing import Any, Dict, List, Optional, cast
from unittest import mock
import pytest
@@ -16,6 +16,7 @@
validate_all_providers_have_committed_successfully,
)
+pytestmark = pytest.mark.integration_batch_1
FROZEN_TIME = "2021-10-25 13:00:00"
GMS_PORT = 8080
GMS_SERVER = f"http://localhost:{GMS_PORT}"
@@ -345,7 +346,6 @@ def loaded_kafka_connect(kafka_connect_runner):
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration_batch_1
def test_kafka_connect_ingest(
loaded_kafka_connect, pytestconfig, tmp_path, test_resources_dir
):
@@ -363,7 +363,6 @@ def test_kafka_connect_ingest(
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration_batch_1
def test_kafka_connect_mongosourceconnect_ingest(
loaded_kafka_connect, pytestconfig, tmp_path, test_resources_dir
):
@@ -381,7 +380,6 @@ def test_kafka_connect_mongosourceconnect_ingest(
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration_batch_1
def test_kafka_connect_s3sink_ingest(
loaded_kafka_connect, pytestconfig, tmp_path, test_resources_dir
):
@@ -399,7 +397,6 @@ def test_kafka_connect_s3sink_ingest(
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration_batch_1
def test_kafka_connect_ingest_stateful(
loaded_kafka_connect, pytestconfig, tmp_path, mock_datahub_graph, test_resources_dir
):
@@ -536,7 +533,7 @@ def test_kafka_connect_ingest_stateful(
assert sorted(deleted_job_urns) == sorted(difference_job_urns)
-def register_mock_api(request_mock: Any, override_data: dict = {}) -> None:
+def register_mock_api(request_mock: Any, override_data: Optional[dict] = None) -> None:
api_vs_response = {
"http://localhost:28083": {
"method": "GET",
@@ -549,7 +546,7 @@ def register_mock_api(request_mock: Any, override_data: dict = {}) -> None:
},
}
- api_vs_response.update(override_data)
+ api_vs_response.update(override_data or {})
for url in api_vs_response.keys():
request_mock.register_uri(
diff --git a/metadata-ingestion/tests/integration/nifi/test_nifi.py b/metadata-ingestion/tests/integration/nifi/test_nifi.py
index 58efd32c6deb3..bf17ee7472258 100644
--- a/metadata-ingestion/tests/integration/nifi/test_nifi.py
+++ b/metadata-ingestion/tests/integration/nifi/test_nifi.py
@@ -7,7 +7,9 @@
from datahub.ingestion.run.pipeline import Pipeline
from tests.test_helpers import fs_helpers, mce_helpers
-from tests.test_helpers.docker_helpers import wait_for_port
+from tests.test_helpers.docker_helpers import cleanup_image, wait_for_port
+
+pytestmark = pytest.mark.integration_batch_2
FROZEN_TIME = "2021-12-03 12:00:00"
@@ -48,9 +50,11 @@ def loaded_nifi(docker_compose_runner, test_resources_dir):
)
yield docker_services
+ # The nifi image is pretty large, so we remove it after the test.
+ cleanup_image("apache/nifi")
+
@freeze_time(FROZEN_TIME)
-@pytest.mark.slow_integration
def test_nifi_ingest_standalone(
loaded_nifi, pytestconfig, tmp_path, test_resources_dir
):
@@ -106,7 +110,6 @@ def test_nifi_ingest_standalone(
@freeze_time(FROZEN_TIME)
-@pytest.mark.slow_integration
def test_nifi_ingest_cluster(loaded_nifi, pytestconfig, tmp_path, test_resources_dir):
# Wait for nifi cluster to execute all lineage processors, max wait time 120 seconds
url = "http://localhost:9080/nifi-api/flow/process-groups/root"
diff --git a/metadata-ingestion/tests/integration/powerbi/golden_test_cll.json b/metadata-ingestion/tests/integration/powerbi/golden_test_cll.json
new file mode 100644
index 0000000000000..5f92cdcfb5bde
--- /dev/null
+++ b/metadata-ingestion/tests/integration/powerbi/golden_test_cll.json
@@ -0,0 +1,1357 @@
+[
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.public_issue_history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "dummy",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.public_issue_history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "05169CD2-E713-41E6-9600-1D8066D95445"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/05169CD2-E713-41E6-9600-1D8066D95445/details",
+ "name": "public issue_history",
+ "description": "Library dataset description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.public_issue_history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.public_issue_history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.SNOWFLAKE_TESTTABLE,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "let\n Source = Snowflake.Databases(\"hp123rt5.ap-southeast-2.fakecomputing.com\",\"PBI_TEST_WAREHOUSE_PROD\",[Role=\"PBI_TEST_MEMBER\"]),\n PBI_TEST_Database = Source{[Name=\"PBI_TEST\",Kind=\"Database\"]}[Data],\n TEST_Schema = PBI_TEST_Database{[Name=\"TEST\",Kind=\"Schema\"]}[Data],\n TESTTABLE_Table = TEST_Schema{[Name=\"TESTTABLE\",Kind=\"Table\"]}[Data]\nin\n TESTTABLE_Table",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.SNOWFLAKE_TESTTABLE,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "05169CD2-E713-41E6-9600-1D8066D95445"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/05169CD2-E713-41E6-9600-1D8066D95445/details",
+ "name": "SNOWFLAKE_TESTTABLE",
+ "description": "Library dataset description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.SNOWFLAKE_TESTTABLE,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.SNOWFLAKE_TESTTABLE,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.SNOWFLAKE_TESTTABLE,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "upstreamLineage",
+ "aspect": {
+ "json": {
+ "upstreams": [
+ {
+ "auditStamp": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "dataset": "urn:li:dataset:(urn:li:dataPlatform:snowflake,PBI_TEST.TEST.TESTTABLE,PROD)",
+ "type": "TRANSFORMED"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "let\n Source = Value.NativeQuery(Snowflake.Databases(\"bu20658.ap-southeast-2.snowflakecomputing.com\",\"operations_analytics_warehouse_prod\",[Role=\"OPERATIONS_ANALYTICS_MEMBER\"]){[Name=\"OPERATIONS_ANALYTICS\"]}[Data], \"SELECT#(lf)concat((UPPER(REPLACE(SELLER,'-',''))), MONTHID) as AGENT_KEY,#(lf)concat((UPPER(REPLACE(CLIENT_DIRECTOR,'-',''))), MONTHID) as CD_AGENT_KEY,#(lf) *#(lf)FROM#(lf)OPERATIONS_ANALYTICS.TRANSFORMED_PROD.V_APS_SME_UNITS_V4\", null, [EnableFolding=true]),\n #\"Added Conditional Column\" = Table.AddColumn(Source, \"SME Units ENT\", each if [DEAL_TYPE] = \"SME Unit\" then [UNIT] else 0),\n #\"Added Conditional Column1\" = Table.AddColumn(#\"Added Conditional Column\", \"Banklink Units\", each if [DEAL_TYPE] = \"Banklink\" then [UNIT] else 0),\n #\"Removed Columns\" = Table.RemoveColumns(#\"Added Conditional Column1\",{\"Banklink Units\"}),\n #\"Added Custom\" = Table.AddColumn(#\"Removed Columns\", \"Banklink Units\", each if [DEAL_TYPE] = \"Banklink\" and [SALES_TYPE] = \"3 - Upsell\"\nthen [UNIT]\n\nelse if [SALES_TYPE] = \"Adjusted BL Migration\"\nthen [UNIT]\n\nelse 0),\n #\"Added Custom1\" = Table.AddColumn(#\"Added Custom\", \"SME Units in $ (*$361)\", each if [DEAL_TYPE] = \"SME Unit\" \nand [SALES_TYPE] <> \"4 - Renewal\"\n then [UNIT] * 361\nelse 0),\n #\"Added Custom2\" = Table.AddColumn(#\"Added Custom1\", \"Banklink in $ (*$148)\", each [Banklink Units] * 148)\nin\n #\"Added Custom2\"",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "05169CD2-E713-41E6-9600-1D8066D95445"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/05169CD2-E713-41E6-9600-1D8066D95445/details",
+ "name": "snowflake native-query",
+ "description": "Library dataset description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "upstreamLineage",
+ "aspect": {
+ "json": {
+ "upstreams": [
+ {
+ "auditStamp": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "dataset": "urn:li:dataset:(urn:li:dataPlatform:snowflake,operations_analytics.transformed_prod.v_aps_sme_units_v4,PROD)",
+ "type": "TRANSFORMED"
+ }
+ ],
+ "fineGrainedLineages": [
+ {
+ "upstreamType": "FIELD_SET",
+ "upstreams": [
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,operations_analytics.transformed_prod.v_aps_sme_units_v4,PROD),monthid)",
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,operations_analytics.transformed_prod.v_aps_sme_units_v4,PROD),seller)"
+ ],
+ "downstreamType": "FIELD",
+ "downstreams": [
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query,DEV),agent_key)"
+ ],
+ "confidenceScore": 1.0
+ },
+ {
+ "upstreamType": "FIELD_SET",
+ "upstreams": [
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,operations_analytics.transformed_prod.v_aps_sme_units_v4,PROD),client_director)",
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,operations_analytics.transformed_prod.v_aps_sme_units_v4,PROD),monthid)"
+ ],
+ "downstreamType": "FIELD",
+ "downstreams": [
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query,DEV),cd_agent_key)"
+ ],
+ "confidenceScore": 1.0
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.big-query-with-parameter,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "let\n Source = GoogleBigQuery.Database([BillingProject = #\"Parameter - Source\"]),\n#\"gcp-project\" = Source{[Name=#\"Parameter - Source\"]}[Data],\nuniversal_Schema = #\"gcp-project\"{[Name=\"universal\",Kind=\"Schema\"]}[Data],\nD_WH_DATE_Table = universal_Schema{[Name=\"D_WH_DATE\",Kind=\"Table\"]}[Data],\n#\"Filtered Rows\" = Table.SelectRows(D_WH_DATE_Table, each [D_DATE] > #datetime(2019, 9, 10, 0, 0, 0)),\n#\"Filtered Rows1\" = Table.SelectRows(#\"Filtered Rows\", each DateTime.IsInPreviousNHours([D_DATE], 87600))\n in \n#\"Filtered Rows1\"",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.big-query-with-parameter,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "05169CD2-E713-41E6-9600-1D8066D95445"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/05169CD2-E713-41E6-9600-1D8066D95445/details",
+ "name": "big-query-with-parameter",
+ "description": "Library dataset description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.big-query-with-parameter,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.big-query-with-parameter,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query-with-join,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "let\n Source = Value.NativeQuery(Snowflake.Databases(\"xaa48144.snowflakecomputing.com\",\"GSL_TEST_WH\",[Role=\"ACCOUNTADMIN\"]){[Name=\"GSL_TEST_DB\"]}[Data], \"select A.name from GSL_TEST_DB.PUBLIC.SALES_ANALYST as A inner join GSL_TEST_DB.PUBLIC.SALES_FORECAST as B on A.name = B.name where startswith(A.name, 'mo')\", null, [EnableFolding=true])\nin\n Source",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query-with-join,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "05169CD2-E713-41E6-9600-1D8066D95445"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/05169CD2-E713-41E6-9600-1D8066D95445/details",
+ "name": "snowflake native-query-with-join",
+ "description": "Library dataset description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.big-query-with-parameter,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "upstreamLineage",
+ "aspect": {
+ "json": {
+ "upstreams": [
+ {
+ "auditStamp": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "dataset": "urn:li:dataset:(urn:li:dataPlatform:bigquery,my-test-project.universal.D_WH_DATE,PROD)",
+ "type": "TRANSFORMED"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query-with-join,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query-with-join,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.job-history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "let\n Source = Oracle.Database(\"localhost:1521/salesdb.GSLAB.COM\", [HierarchicalNavigation=true]), HR = Source{[Schema=\"HR\"]}[Data], EMPLOYEES1 = HR{[Name=\"EMPLOYEES\"]}[Data] \n in EMPLOYEES1",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query-with-join,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "upstreamLineage",
+ "aspect": {
+ "json": {
+ "upstreams": [
+ {
+ "auditStamp": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "dataset": "urn:li:dataset:(urn:li:dataPlatform:snowflake,gsl_test_db.public.sales_analyst,PROD)",
+ "type": "TRANSFORMED"
+ },
+ {
+ "auditStamp": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "dataset": "urn:li:dataset:(urn:li:dataPlatform:snowflake,gsl_test_db.public.sales_forecast,PROD)",
+ "type": "TRANSFORMED"
+ }
+ ],
+ "fineGrainedLineages": [
+ {
+ "upstreamType": "FIELD_SET",
+ "upstreams": [
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,gsl_test_db.public.sales_analyst,PROD),name)"
+ ],
+ "downstreamType": "FIELD",
+ "downstreams": [
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query-with-join,DEV),name)"
+ ],
+ "confidenceScore": 1.0
+ },
+ {
+ "upstreamType": "FIELD_SET",
+ "upstreams": [
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,gsl_test_db.public.sales_analyst,PROD),name)"
+ ],
+ "downstreamType": "FIELD",
+ "downstreams": [
+ "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query-with-join,DEV),name)"
+ ],
+ "confidenceScore": 1.0
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.job-history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "05169CD2-E713-41E6-9600-1D8066D95445"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/05169CD2-E713-41E6-9600-1D8066D95445/details",
+ "name": "job-history",
+ "description": "Library dataset description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.job-history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.job-history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.job-history,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "upstreamLineage",
+ "aspect": {
+ "json": {
+ "upstreams": [
+ {
+ "auditStamp": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "dataset": "urn:li:dataset:(urn:li:dataPlatform:oracle,salesdb.HR.EMPLOYEES,PROD)",
+ "type": "TRANSFORMED"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.postgres_test_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "let\n Source = PostgreSQL.Database(\"localhost\" , \"mics\" ),\n public_order_date = Source{[Schema=\"public\",Item=\"order_date\"]}[Data] \n in \n public_order_date",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.postgres_test_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "05169CD2-E713-41E6-9600-1D8066D95445"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/05169CD2-E713-41E6-9600-1D8066D95445/details",
+ "name": "postgres_test_table",
+ "description": "Library dataset description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.postgres_test_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.postgres_test_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.postgres_test_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "upstreamLineage",
+ "aspect": {
+ "json": {
+ "upstreams": [
+ {
+ "auditStamp": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "dataset": "urn:li:dataset:(urn:li:dataPlatform:postgres,mics.public.order_date,PROD)",
+ "type": "TRANSFORMED"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.dbo_book_issue,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "let\n Source = Sql.Database(\"localhost\", \"library\"),\n dbo_book_issue = Source{[Schema=\"dbo\",Item=\"book_issue\"]}[Data]\n in dbo_book_issue",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.dbo_book_issue,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "ba0130a1-5b03-40de-9535-b34e778ea6ed"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/ba0130a1-5b03-40de-9535-b34e778ea6ed/details",
+ "name": "dbo_book_issue",
+ "description": "hr pbi test description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.dbo_book_issue,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.dbo_book_issue,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.ms_sql_native_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "let\n Source = Sql.Database(\"AUPRDWHDB\", \"COMMOPSDB\", [Query=\"select *,#(lf)concat((UPPER(REPLACE(CLIENT_DIRECTOR,'-',''))), MONTH_WID) as CD_AGENT_KEY,#(lf)concat((UPPER(REPLACE(CLIENT_MANAGER_CLOSING_MONTH,'-',''))), MONTH_WID) as AGENT_KEY#(lf)#(lf)from V_PS_CD_RETENTION\", CommandTimeout=#duration(0, 1, 30, 0)]),\n #\"Changed Type\" = Table.TransformColumnTypes(Source,{{\"mth_date\", type date}}),\n #\"Added Custom\" = Table.AddColumn(#\"Changed Type\", \"Month\", each Date.Month([mth_date])),\n #\"Added Custom1\" = Table.AddColumn(#\"Added Custom\", \"TPV Opening\", each if [Month] = 1 then [TPV_AMV_OPENING]\nelse if [Month] = 2 then 0\nelse if [Month] = 3 then 0\nelse if [Month] = 4 then [TPV_AMV_OPENING]\nelse if [Month] = 5 then 0\nelse if [Month] = 6 then 0\nelse if [Month] = 7 then [TPV_AMV_OPENING]\nelse if [Month] = 8 then 0\nelse if [Month] = 9 then 0\nelse if [Month] = 10 then [TPV_AMV_OPENING]\nelse if [Month] = 11 then 0\nelse if [Month] = 12 then 0\n\nelse 0)\nin\n #\"Added Custom1\"",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.ms_sql_native_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "ba0130a1-5b03-40de-9535-b34e778ea6ed"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/ba0130a1-5b03-40de-9535-b34e778ea6ed/details",
+ "name": "ms_sql_native_table",
+ "description": "hr pbi test description",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.dbo_book_issue,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "upstreamLineage",
+ "aspect": {
+ "json": {
+ "upstreams": [
+ {
+ "auditStamp": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "dataset": "urn:li:dataset:(urn:li:dataPlatform:mssql,library.dbo.book_issue,PROD)",
+ "type": "TRANSFORMED"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.ms_sql_native_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.ms_sql_native_table,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "corpuser",
+ "entityUrn": "urn:li:corpuser:users.User1@foo.com",
+ "changeType": "UPSERT",
+ "aspectName": "corpUserKey",
+ "aspect": {
+ "json": {
+ "username": "User1@foo.com"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "corpuser",
+ "entityUrn": "urn:li:corpuser:users.User2@foo.com",
+ "changeType": "UPSERT",
+ "aspectName": "corpUserKey",
+ "aspect": {
+ "json": {
+ "username": "User2@foo.com"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.B8E293DC-0C83-4AA0-9BB9-0A8738DF24A0)",
+ "changeType": "UPSERT",
+ "aspectName": "chartInfo",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "createdFrom": "Dataset",
+ "datasetId": "05169CD2-E713-41E6-9600-1D8066D95445",
+ "datasetWebUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/05169CD2-E713-41E6-9600-1D8066D95445/details"
+ },
+ "title": "test_tile",
+ "description": "test_tile",
+ "lastModified": {
+ "created": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "lastModified": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ }
+ },
+ "inputs": [
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.public_issue_history,DEV)"
+ },
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.SNOWFLAKE_TESTTABLE,DEV)"
+ },
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query,DEV)"
+ },
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.big-query-with-parameter,DEV)"
+ },
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.snowflake_native-query-with-join,DEV)"
+ },
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.job-history,DEV)"
+ },
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,library-dataset.postgres_test_table,DEV)"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.B8E293DC-0C83-4AA0-9BB9-0A8738DF24A0)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.B8E293DC-0C83-4AA0-9BB9-0A8738DF24A0)",
+ "changeType": "UPSERT",
+ "aspectName": "chartKey",
+ "aspect": {
+ "json": {
+ "dashboardTool": "powerbi",
+ "chartId": "powerbi.linkedin.com/charts/B8E293DC-0C83-4AA0-9BB9-0A8738DF24A0"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.B8E293DC-0C83-4AA0-9BB9-0A8738DF24A0)",
+ "changeType": "UPSERT",
+ "aspectName": "browsePaths",
+ "aspect": {
+ "json": {
+ "paths": [
+ "/powerbi/demo-workspace"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.B8E293DC-0C83-4AA0-9BB9-0A8738DF24A0)",
+ "changeType": "UPSERT",
+ "aspectName": "browsePathsV2",
+ "aspect": {
+ "json": {
+ "path": [
+ {
+ "id": "demo-workspace"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.23212598-23b5-4980-87cc-5fc0ecd84385)",
+ "changeType": "UPSERT",
+ "aspectName": "chartInfo",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "createdFrom": "Dataset",
+ "datasetId": "ba0130a1-5b03-40de-9535-b34e778ea6ed",
+ "datasetWebUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/ba0130a1-5b03-40de-9535-b34e778ea6ed/details"
+ },
+ "title": "yearly_sales",
+ "description": "yearly_sales",
+ "lastModified": {
+ "created": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "lastModified": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ }
+ },
+ "inputs": [
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.dbo_book_issue,DEV)"
+ },
+ {
+ "string": "urn:li:dataset:(urn:li:dataPlatform:powerbi,hr_pbi_test.ms_sql_native_table,DEV)"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.23212598-23b5-4980-87cc-5fc0ecd84385)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.23212598-23b5-4980-87cc-5fc0ecd84385)",
+ "changeType": "UPSERT",
+ "aspectName": "chartKey",
+ "aspect": {
+ "json": {
+ "dashboardTool": "powerbi",
+ "chartId": "powerbi.linkedin.com/charts/23212598-23b5-4980-87cc-5fc0ecd84385"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.23212598-23b5-4980-87cc-5fc0ecd84385)",
+ "changeType": "UPSERT",
+ "aspectName": "browsePaths",
+ "aspect": {
+ "json": {
+ "paths": [
+ "/powerbi/demo-workspace"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "chart",
+ "entityUrn": "urn:li:chart:(powerbi,charts.23212598-23b5-4980-87cc-5fc0ecd84385)",
+ "changeType": "UPSERT",
+ "aspectName": "browsePathsV2",
+ "aspect": {
+ "json": {
+ "path": [
+ {
+ "id": "demo-workspace"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dashboard",
+ "entityUrn": "urn:li:dashboard:(powerbi,dashboards.7D668CAD-7FFC-4505-9215-655BCA5BEBAE)",
+ "changeType": "UPSERT",
+ "aspectName": "browsePaths",
+ "aspect": {
+ "json": {
+ "paths": [
+ "/powerbi/demo-workspace"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dashboard",
+ "entityUrn": "urn:li:dashboard:(powerbi,dashboards.7D668CAD-7FFC-4505-9215-655BCA5BEBAE)",
+ "changeType": "UPSERT",
+ "aspectName": "dashboardInfo",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "chartCount": "2",
+ "workspaceName": "demo-workspace",
+ "workspaceId": "64ED5CAD-7C10-4684-8180-826122881108"
+ },
+ "title": "test_dashboard",
+ "description": "Description of test dashboard",
+ "charts": [
+ "urn:li:chart:(powerbi,charts.B8E293DC-0C83-4AA0-9BB9-0A8738DF24A0)",
+ "urn:li:chart:(powerbi,charts.23212598-23b5-4980-87cc-5fc0ecd84385)"
+ ],
+ "datasets": [],
+ "lastModified": {
+ "created": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ },
+ "lastModified": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ }
+ },
+ "dashboardUrl": "https://localhost/dashboards/web/1"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dashboard",
+ "entityUrn": "urn:li:dashboard:(powerbi,dashboards.7D668CAD-7FFC-4505-9215-655BCA5BEBAE)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dashboard",
+ "entityUrn": "urn:li:dashboard:(powerbi,dashboards.7D668CAD-7FFC-4505-9215-655BCA5BEBAE)",
+ "changeType": "UPSERT",
+ "aspectName": "dashboardKey",
+ "aspect": {
+ "json": {
+ "dashboardTool": "powerbi",
+ "dashboardId": "powerbi.linkedin.com/dashboards/7D668CAD-7FFC-4505-9215-655BCA5BEBAE"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dashboard",
+ "entityUrn": "urn:li:dashboard:(powerbi,dashboards.7D668CAD-7FFC-4505-9215-655BCA5BEBAE)",
+ "changeType": "UPSERT",
+ "aspectName": "ownership",
+ "aspect": {
+ "json": {
+ "owners": [
+ {
+ "owner": "urn:li:corpuser:users.User1@foo.com",
+ "type": "NONE"
+ },
+ {
+ "owner": "urn:li:corpuser:users.User2@foo.com",
+ "type": "NONE"
+ }
+ ],
+ "lastModified": {
+ "time": 0,
+ "actor": "urn:li:corpuser:unknown"
+ }
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dashboard",
+ "entityUrn": "urn:li:dashboard:(powerbi,dashboards.7D668CAD-7FFC-4505-9215-655BCA5BEBAE)",
+ "changeType": "UPSERT",
+ "aspectName": "browsePathsV2",
+ "aspect": {
+ "json": {
+ "path": [
+ {
+ "id": "demo-workspace"
+ }
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,employee-dataset.employee_ctc,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "viewProperties",
+ "aspect": {
+ "json": {
+ "materialized": false,
+ "viewLogic": "dummy",
+ "viewLanguage": "m_query"
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "corpuser",
+ "entityUrn": "urn:li:corpuser:users.User1@foo.com",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,employee-dataset.employee_ctc,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,employee-dataset.employee_ctc,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "subTypes",
+ "aspect": {
+ "json": {
+ "typeNames": [
+ "PowerBI Dataset Table",
+ "View"
+ ]
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "dataset",
+ "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:powerbi,employee-dataset.employee_ctc,DEV)",
+ "changeType": "UPSERT",
+ "aspectName": "datasetProperties",
+ "aspect": {
+ "json": {
+ "customProperties": {
+ "datasetId": "91580e0e-1680-4b1c-bbf9-4f6764d7a5ff"
+ },
+ "externalUrl": "http://localhost/groups/64ED5CAD-7C10-4684-8180-826122881108/datasets/91580e0e-1680-4b1c-bbf9-4f6764d7a5ff/details",
+ "name": "employee_ctc",
+ "description": "Employee Management",
+ "tags": []
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+},
+{
+ "entityType": "corpuser",
+ "entityUrn": "urn:li:corpuser:users.User2@foo.com",
+ "changeType": "UPSERT",
+ "aspectName": "status",
+ "aspect": {
+ "json": {
+ "removed": false
+ }
+ },
+ "systemMetadata": {
+ "lastObserved": 1643871600000,
+ "runId": "powerbi-test"
+ }
+}
+]
\ No newline at end of file
diff --git a/metadata-ingestion/tests/integration/powerbi/test_m_parser.py b/metadata-ingestion/tests/integration/powerbi/test_m_parser.py
index e77a12aa4088e..2e9c02ef759a5 100644
--- a/metadata-ingestion/tests/integration/powerbi/test_m_parser.py
+++ b/metadata-ingestion/tests/integration/powerbi/test_m_parser.py
@@ -15,8 +15,11 @@
AbstractDataPlatformInstanceResolver,
create_dataplatform_instance_resolver,
)
-from datahub.ingestion.source.powerbi.m_query import parser, tree_function
-from datahub.ingestion.source.powerbi.m_query.resolver import DataPlatformTable
+from datahub.ingestion.source.powerbi.m_query import parser, resolver, tree_function
+from datahub.ingestion.source.powerbi.m_query.resolver import DataPlatformTable, Lineage
+from datahub.utilities.sqlglot_lineage import ColumnLineageInfo, DownstreamColumnRef
+
+pytestmark = pytest.mark.slow
M_QUERIES = [
'let\n Source = Snowflake.Databases("bu10758.ap-unknown-2.fakecomputing.com","PBI_TEST_WAREHOUSE_PROD",[Role="PBI_TEST_MEMBER"]),\n PBI_TEST_Database = Source{[Name="PBI_TEST",Kind="Database"]}[Data],\n TEST_Schema = PBI_TEST_Database{[Name="TEST",Kind="Schema"]}[Data],\n TESTTABLE_Table = TEST_Schema{[Name="TESTTABLE",Kind="Table"]}[Data]\nin\n TESTTABLE_Table',
@@ -68,6 +71,15 @@ def get_default_instances(
return PipelineContext(run_id="fake"), config, platform_instance_resolver
+def combine_upstreams_from_lineage(lineage: List[Lineage]) -> List[DataPlatformTable]:
+ data_platforms: List[DataPlatformTable] = []
+
+ for item in lineage:
+ data_platforms.extend(item.upstreams)
+
+ return data_platforms
+
+
@pytest.mark.integration
def test_parse_m_query1():
expression: str = M_QUERIES[0]
@@ -180,7 +192,7 @@ def test_snowflake_regular_case():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -210,7 +222,7 @@ def test_postgres_regular_case():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -240,7 +252,7 @@ def test_databricks_regular_case():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -270,7 +282,7 @@ def test_oracle_regular_case():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -300,7 +312,7 @@ def test_mssql_regular_case():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -346,7 +358,7 @@ def test_mssql_with_query():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert data_platform_tables[0].urn == expected_tables[index]
@@ -386,7 +398,7 @@ def test_snowflake_native_query():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert data_platform_tables[0].urn == expected_tables[index]
@@ -408,7 +420,7 @@ def test_google_bigquery_1():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -440,7 +452,7 @@ def test_google_bigquery_2():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -470,7 +482,7 @@ def test_for_each_expression_1():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -499,7 +511,7 @@ def test_for_each_expression_2():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -521,15 +533,15 @@ def test_native_query_disabled():
reporter = PowerBiDashboardSourceReport()
ctx, config, platform_instance_resolver = get_default_instances()
- config.native_query_parsing = False
- data_platform_tables: List[DataPlatformTable] = parser.get_upstream_tables(
+ config.native_query_parsing = False # Disable native query parsing
+ lineage: List[Lineage] = parser.get_upstream_tables(
table,
reporter,
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
)
- assert len(data_platform_tables) == 0
+ assert len(lineage) == 0
@pytest.mark.integration
@@ -546,12 +558,14 @@ def test_multi_source_table():
ctx, config, platform_instance_resolver = get_default_instances()
- data_platform_tables: List[DataPlatformTable] = parser.get_upstream_tables(
- table,
- reporter,
- ctx=ctx,
- config=config,
- platform_instance_resolver=platform_instance_resolver,
+ data_platform_tables: List[DataPlatformTable] = combine_upstreams_from_lineage(
+ parser.get_upstream_tables(
+ table,
+ reporter,
+ ctx=ctx,
+ config=config,
+ platform_instance_resolver=platform_instance_resolver,
+ )
)
assert len(data_platform_tables) == 2
@@ -579,12 +593,14 @@ def test_table_combine():
ctx, config, platform_instance_resolver = get_default_instances()
- data_platform_tables: List[DataPlatformTable] = parser.get_upstream_tables(
- table,
- reporter,
- ctx=ctx,
- config=config,
- platform_instance_resolver=platform_instance_resolver,
+ data_platform_tables: List[DataPlatformTable] = combine_upstreams_from_lineage(
+ parser.get_upstream_tables(
+ table,
+ reporter,
+ ctx=ctx,
+ config=config,
+ platform_instance_resolver=platform_instance_resolver,
+ )
)
assert len(data_platform_tables) == 2
@@ -622,7 +638,7 @@ def test_expression_is_none():
ctx, config, platform_instance_resolver = get_default_instances()
- data_platform_tables: List[DataPlatformTable] = parser.get_upstream_tables(
+ lineage: List[Lineage] = parser.get_upstream_tables(
table,
reporter,
ctx=ctx,
@@ -630,7 +646,7 @@ def test_expression_is_none():
platform_instance_resolver=platform_instance_resolver,
)
- assert len(data_platform_tables) == 0
+ assert len(lineage) == 0
def test_redshift_regular_case():
@@ -649,7 +665,7 @@ def test_redshift_regular_case():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -676,7 +692,7 @@ def test_redshift_native_query():
ctx=ctx,
config=config,
platform_instance_resolver=platform_instance_resolver,
- )
+ )[0].upstreams
assert len(data_platform_tables) == 1
assert (
@@ -706,7 +722,7 @@ def test_sqlglot_parser():
}
)
- data_platform_tables: List[DataPlatformTable] = parser.get_upstream_tables(
+ lineage: List[resolver.Lineage] = parser.get_upstream_tables(
table,
reporter,
ctx=ctx,
@@ -714,6 +730,8 @@ def test_sqlglot_parser():
platform_instance_resolver=platform_instance_resolver,
)
+ data_platform_tables: List[DataPlatformTable] = lineage[0].upstreams
+
assert len(data_platform_tables) == 2
assert (
data_platform_tables[0].urn
@@ -723,3 +741,76 @@ def test_sqlglot_parser():
data_platform_tables[1].urn
== "urn:li:dataset:(urn:li:dataPlatform:snowflake,sales_deployment.operations_analytics.transformed_prod.v_sme_unit_targets,PROD)"
)
+
+ assert lineage[0].column_lineage == [
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="client_director"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="tier"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column='upper("manager")'),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="team_type"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="date_target"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="monthid"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="target_team"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="seller_email"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="agent_key"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="sme_quota"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="revenue_quota"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="service_quota"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="bl_target"),
+ upstreams=[],
+ logic=None,
+ ),
+ ColumnLineageInfo(
+ downstream=DownstreamColumnRef(table=None, column="software_quota"),
+ upstreams=[],
+ logic=None,
+ ),
+ ]
diff --git a/metadata-ingestion/tests/integration/powerbi/test_powerbi.py b/metadata-ingestion/tests/integration/powerbi/test_powerbi.py
index 5036f758a7de9..b0695e3ea9954 100644
--- a/metadata-ingestion/tests/integration/powerbi/test_powerbi.py
+++ b/metadata-ingestion/tests/integration/powerbi/test_powerbi.py
@@ -1,4 +1,5 @@
import logging
+import re
import sys
from typing import Any, Dict, List, cast
from unittest import mock
@@ -20,6 +21,7 @@
)
from tests.test_helpers import mce_helpers
+pytestmark = pytest.mark.slow
FROZEN_TIME = "2022-02-03 07:00:00"
@@ -1126,7 +1128,7 @@ def test_dataset_type_mapping_error(
"""
register_mock_api(request_mock=requests_mock)
- try:
+ with pytest.raises(Exception, match=r"dataset_type_mapping is deprecated"):
Pipeline.create(
{
"run_id": "powerbi-test",
@@ -1149,11 +1151,6 @@ def test_dataset_type_mapping_error(
},
}
)
- except Exception as e:
- assert (
- "dataset_type_mapping is deprecated. Use server_to_platform_instance only."
- in str(e)
- )
@freeze_time(FROZEN_TIME)
@@ -1505,3 +1502,90 @@ def test_independent_datasets_extraction(
output_path=tmp_path / "powerbi_independent_mces.json",
golden_path=f"{test_resources_dir}/{golden_file}",
)
+
+
+@freeze_time(FROZEN_TIME)
+@mock.patch("msal.ConfidentialClientApplication", side_effect=mock_msal_cca)
+def test_cll_extraction(mock_msal, pytestconfig, tmp_path, mock_time, requests_mock):
+
+ test_resources_dir = pytestconfig.rootpath / "tests/integration/powerbi"
+
+ register_mock_api(
+ request_mock=requests_mock,
+ )
+
+ default_conf: dict = default_source_config()
+
+ del default_conf[
+ "dataset_type_mapping"
+ ] # delete this key so that connector set it to default (all dataplatform)
+
+ pipeline = Pipeline.create(
+ {
+ "run_id": "powerbi-test",
+ "source": {
+ "type": "powerbi",
+ "config": {
+ **default_conf,
+ "extract_lineage": True,
+ "extract_column_level_lineage": True,
+ "enable_advance_lineage_sql_construct": True,
+ "native_query_parsing": True,
+ "extract_independent_datasets": True,
+ },
+ },
+ "sink": {
+ "type": "file",
+ "config": {
+ "filename": f"{tmp_path}/powerbi_cll_mces.json",
+ },
+ },
+ }
+ )
+
+ pipeline.run()
+ pipeline.raise_from_status()
+ golden_file = "golden_test_cll.json"
+
+ mce_helpers.check_golden_file(
+ pytestconfig,
+ output_path=tmp_path / "powerbi_cll_mces.json",
+ golden_path=f"{test_resources_dir}/{golden_file}",
+ )
+
+
+@freeze_time(FROZEN_TIME)
+@mock.patch("msal.ConfidentialClientApplication", side_effect=mock_msal_cca)
+def test_cll_extraction_flags(
+ mock_msal, pytestconfig, tmp_path, mock_time, requests_mock
+):
+
+ register_mock_api(
+ request_mock=requests_mock,
+ )
+
+ default_conf: dict = default_source_config()
+ pattern: str = re.escape(
+ "Enable all these flags in recipe: ['native_query_parsing', 'enable_advance_lineage_sql_construct', 'extract_lineage']"
+ )
+
+ with pytest.raises(Exception, match=pattern):
+
+ Pipeline.create(
+ {
+ "run_id": "powerbi-test",
+ "source": {
+ "type": "powerbi",
+ "config": {
+ **default_conf,
+ "extract_column_level_lineage": True,
+ },
+ },
+ "sink": {
+ "type": "file",
+ "config": {
+ "filename": f"{tmp_path}/powerbi_cll_mces.json",
+ },
+ },
+ }
+ )
diff --git a/metadata-ingestion/tests/integration/presto-on-hive/test_presto_on_hive.py b/metadata-ingestion/tests/integration/presto-on-hive/test_presto_on_hive.py
index 17e21f3790070..31d801ccf7dee 100644
--- a/metadata-ingestion/tests/integration/presto-on-hive/test_presto_on_hive.py
+++ b/metadata-ingestion/tests/integration/presto-on-hive/test_presto_on_hive.py
@@ -10,6 +10,7 @@
from tests.test_helpers import fs_helpers, mce_helpers
from tests.test_helpers.docker_helpers import wait_for_port
+pytestmark = pytest.mark.integration_batch_1
FROZEN_TIME = "2021-09-23 12:00:00"
data_platform = "presto-on-hive"
@@ -51,7 +52,6 @@ def loaded_presto_on_hive(presto_on_hive_runner):
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration_batch_1
@pytest.mark.parametrize(
"mode,use_catalog_subtype,use_dataset_pascalcase_subtype,include_catalog_name_in_ids,simplify_nested_field_paths,"
"test_suffix",
@@ -137,7 +137,6 @@ def test_presto_on_hive_ingest(
@freeze_time(FROZEN_TIME)
-@pytest.mark.integration_batch_1
def test_presto_on_hive_instance_ingest(
loaded_presto_on_hive, test_resources_dir, pytestconfig, tmp_path, mock_time
):
diff --git a/metadata-ingestion/tests/integration/tableau/test_tableau_ingest.py b/metadata-ingestion/tests/integration/tableau/test_tableau_ingest.py
index 71428a7847953..53b8519a886d3 100644
--- a/metadata-ingestion/tests/integration/tableau/test_tableau_ingest.py
+++ b/metadata-ingestion/tests/integration/tableau/test_tableau_ingest.py
@@ -757,7 +757,7 @@ def test_tableau_no_verify():
@freeze_time(FROZEN_TIME)
-@pytest.mark.slow_unit
+@pytest.mark.slow
def test_tableau_signout_timeout(pytestconfig, tmp_path, mock_datahub_graph):
enable_logging()
output_file_name: str = "tableau_signout_timeout_mces.json"
diff --git a/metadata-ingestion/tests/test_helpers/docker_helpers.py b/metadata-ingestion/tests/test_helpers/docker_helpers.py
index f0db2d91e362c..30157c3a78094 100644
--- a/metadata-ingestion/tests/test_helpers/docker_helpers.py
+++ b/metadata-ingestion/tests/test_helpers/docker_helpers.py
@@ -73,3 +73,26 @@ def run(
yield docker_services
return run
+
+
+def cleanup_image(image_name: str) -> None:
+ assert ":" not in image_name, "image_name should not contain a tag"
+
+ images_proc = subprocess.run(
+ f"docker image ls --filter 'reference={image_name}*' -q",
+ shell=True,
+ capture_output=True,
+ text=True,
+ check=True,
+ )
+
+ if not images_proc.stdout:
+ logger.debug(f"No images to cleanup for {image_name}")
+ return
+
+ image_ids = images_proc.stdout.splitlines()
+ subprocess.run(
+ f"docker image rm {' '.join(image_ids)}",
+ shell=True,
+ check=True,
+ )
diff --git a/metadata-ingestion/tests/unit/sql_parsing/goldens/test_create_table_ddl.json b/metadata-ingestion/tests/unit/sql_parsing/goldens/test_create_table_ddl.json
new file mode 100644
index 0000000000000..4773974545bfa
--- /dev/null
+++ b/metadata-ingestion/tests/unit/sql_parsing/goldens/test_create_table_ddl.json
@@ -0,0 +1,8 @@
+{
+ "query_type": "CREATE",
+ "in_tables": [],
+ "out_tables": [
+ "urn:li:dataset:(urn:li:dataPlatform:sqlite,costs,PROD)"
+ ],
+ "column_lineage": null
+}
\ No newline at end of file
diff --git a/metadata-ingestion/tests/unit/sql_parsing/test_sqlglot_lineage.py b/metadata-ingestion/tests/unit/sql_parsing/test_sqlglot_lineage.py
index 483c1ac4cc7f9..2a965a9bb1e61 100644
--- a/metadata-ingestion/tests/unit/sql_parsing/test_sqlglot_lineage.py
+++ b/metadata-ingestion/tests/unit/sql_parsing/test_sqlglot_lineage.py
@@ -274,6 +274,21 @@ def test_expand_select_star_basic():
)
+def test_create_table_ddl():
+ assert_sql_result(
+ """
+CREATE TABLE IF NOT EXISTS costs (
+ id INTEGER PRIMARY KEY,
+ month TEXT NOT NULL,
+ total_cost REAL NOT NULL,
+ area REAL NOT NULL
+)
+""",
+ dialect="sqlite",
+ expected_file=RESOURCE_DIR / "test_create_table_ddl.json",
+ )
+
+
def test_snowflake_column_normalization():
# Technically speaking this is incorrect since the column names are different and both quoted.
diff --git a/metadata-ingestion/tests/unit/test_sql_common.py b/metadata-ingestion/tests/unit/test_sql_common.py
index 95af0e623e991..808b38192411d 100644
--- a/metadata-ingestion/tests/unit/test_sql_common.py
+++ b/metadata-ingestion/tests/unit/test_sql_common.py
@@ -4,12 +4,11 @@
import pytest
from sqlalchemy.engine.reflection import Inspector
-from datahub.ingestion.source.sql.sql_common import (
- PipelineContext,
- SQLAlchemySource,
+from datahub.ingestion.source.sql.sql_common import PipelineContext, SQLAlchemySource
+from datahub.ingestion.source.sql.sql_config import SQLCommonConfig
+from datahub.ingestion.source.sql.sqlalchemy_uri_mapper import (
get_platform_from_sqlalchemy_uri,
)
-from datahub.ingestion.source.sql.sql_config import SQLCommonConfig
class _TestSQLAlchemyConfig(SQLCommonConfig):
diff --git a/metadata-ingestion/tests/unit/test_transform_dataset.py b/metadata-ingestion/tests/unit/test_transform_dataset.py
index 8b2535eea1fe9..bc95451620d22 100644
--- a/metadata-ingestion/tests/unit/test_transform_dataset.py
+++ b/metadata-ingestion/tests/unit/test_transform_dataset.py
@@ -62,6 +62,9 @@
)
from datahub.ingestion.transformer.dataset_transformer import DatasetTransformer
from datahub.ingestion.transformer.extract_dataset_tags import ExtractDatasetTags
+from datahub.ingestion.transformer.extract_ownership_from_tags import (
+ ExtractOwnersFromTagsTransformer,
+)
from datahub.ingestion.transformer.mark_dataset_status import MarkDatasetStatus
from datahub.ingestion.transformer.remove_dataset_ownership import (
SimpleRemoveDatasetOwnership,
@@ -72,6 +75,7 @@
GlobalTagsClass,
MetadataChangeEventClass,
OwnershipClass,
+ OwnershipTypeClass,
StatusClass,
TagAssociationClass,
)
@@ -586,6 +590,91 @@ def test_mark_status_dataset(tmp_path):
)
+def test_extract_owners_from_tags():
+ def _test_owner(
+ tag: str,
+ config: Dict,
+ expected_owner: str,
+ expected_owner_type: Optional[str] = None,
+ ) -> None:
+ dataset = make_generic_dataset(
+ aspects=[
+ models.GlobalTagsClass(
+ tags=[TagAssociationClass(tag=builder.make_tag_urn(tag))]
+ )
+ ]
+ )
+ transformer = ExtractOwnersFromTagsTransformer.create(
+ config,
+ PipelineContext(run_id="test"),
+ )
+ transformed = list(
+ transformer.transform(
+ [
+ RecordEnvelope(dataset, metadata={}),
+ ]
+ )
+ )
+ owners_aspect = transformed[0].record.proposedSnapshot.aspects[0]
+ owners = owners_aspect.owners
+ owner = owners[0]
+ if expected_owner_type is not None:
+ assert owner.type == expected_owner_type
+ assert owner.owner == expected_owner
+
+ _test_owner(
+ tag="owner:foo",
+ config={
+ "tag_prefix": "owner:",
+ },
+ expected_owner="urn:li:corpuser:foo",
+ )
+ _test_owner(
+ tag="abcdef-owner:foo",
+ config={
+ "tag_prefix": ".*owner:",
+ },
+ expected_owner="urn:li:corpuser:foo",
+ )
+ _test_owner(
+ tag="owner:foo",
+ config={
+ "tag_prefix": "owner:",
+ "is_user": False,
+ },
+ expected_owner="urn:li:corpGroup:foo",
+ )
+ _test_owner(
+ tag="owner:foo",
+ config={
+ "tag_prefix": "owner:",
+ "email_domain": "example.com",
+ },
+ expected_owner="urn:li:corpuser:foo@example.com",
+ )
+ _test_owner(
+ tag="owner:foo",
+ config={
+ "tag_prefix": "owner:",
+ "email_domain": "example.com",
+ "owner_type": "TECHNICAL_OWNER",
+ },
+ expected_owner="urn:li:corpuser:foo@example.com",
+ expected_owner_type=OwnershipTypeClass.TECHNICAL_OWNER,
+ )
+ _test_owner(
+ tag="owner:foo",
+ config={
+ "tag_prefix": "owner:",
+ "email_domain": "example.com",
+ "owner_type": "AUTHOR",
+ "owner_type_urn": "urn:li:ownershipType:ad8557d6-dcb9-4d2a-83fc-b7d0d54f3e0f",
+ },
+ expected_owner="urn:li:corpuser:foo@example.com",
+ expected_owner_type=OwnershipTypeClass.CUSTOM,
+ )
+
+
def test_add_dataset_browse_paths():
dataset = make_generic_dataset()
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/entity/AspectDao.java b/metadata-io/src/main/java/com/linkedin/metadata/entity/AspectDao.java
index 2d5c5e23ae528..42dd3f0405a6a 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/entity/AspectDao.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/entity/AspectDao.java
@@ -8,6 +8,7 @@
import io.ebean.PagedList;
import io.ebean.Transaction;
+import java.util.stream.Stream;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import java.sql.Timestamp;
@@ -103,6 +104,9 @@ Integer countAspect(
@Nonnull
PagedList getPagedAspects(final RestoreIndicesArgs args);
+ @Nonnull
+ Stream streamAspects(String entityName, String aspectName);
+
int deleteUrn(@Nullable Transaction tx, @Nonnull final String urn);
@Nonnull
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/entity/EntityServiceImpl.java b/metadata-io/src/main/java/com/linkedin/metadata/entity/EntityServiceImpl.java
index 66188473b9d03..57f88e31deea5 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/entity/EntityServiceImpl.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/entity/EntityServiceImpl.java
@@ -3,6 +3,7 @@
import com.codahale.metrics.Timer;
import com.linkedin.data.template.GetMode;
import com.linkedin.data.template.SetMode;
+import com.linkedin.entity.client.SystemEntityClient;
import com.linkedin.metadata.config.PreProcessHooks;
import com.datahub.util.RecordUtils;
import com.datahub.util.exception.ModelConversionException;
@@ -93,6 +94,7 @@
import javax.persistence.EntityNotFoundException;
import io.ebean.Transaction;
+import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import static com.linkedin.metadata.Constants.*;
@@ -144,11 +146,11 @@ public class EntityServiceImpl implements EntityService {
private final Map> _entityToValidAspects;
private RetentionService _retentionService;
private final Boolean _alwaysEmitChangeLog;
+ @Getter
private final UpdateIndicesService _updateIndicesService;
private final PreProcessHooks _preProcessHooks;
protected static final int MAX_KEYS_PER_QUERY = 500;
-
private final Integer ebeanMaxTransactionRetry;
public EntityServiceImpl(
@@ -180,6 +182,11 @@ public EntityServiceImpl(
ebeanMaxTransactionRetry = retry != null ? retry : DEFAULT_MAX_TRANSACTION_RETRY;
}
+ @Override
+ public void setSystemEntityClient(SystemEntityClient systemEntityClient) {
+ this._updateIndicesService.setSystemEntityClient(systemEntityClient);
+ }
+
/**
* Retrieves the latest aspects corresponding to a batch of {@link Urn}s based on a provided
* set of aspect names.
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/entity/cassandra/CassandraAspectDao.java b/metadata-io/src/main/java/com/linkedin/metadata/entity/cassandra/CassandraAspectDao.java
index b215dd4a5d1ed..9f4a36efb4501 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/entity/cassandra/CassandraAspectDao.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/entity/cassandra/CassandraAspectDao.java
@@ -41,6 +41,7 @@
import java.util.Set;
import java.util.function.Function;
import java.util.stream.Collectors;
+import java.util.stream.Stream;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
@@ -445,6 +446,12 @@ public PagedList getPagedAspects(final RestoreIndicesArgs args) {
return null;
}
+ @Nonnull
+ @Override
+ public Stream streamAspects(String entityName, String aspectName) {
+ // Not implemented
+ return null;
+ }
@Override
@Nonnull
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java b/metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java
index 30886db264994..c16c98b34f3eb 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java
@@ -42,6 +42,7 @@
import java.util.Set;
import java.util.function.Function;
import java.util.stream.Collectors;
+import java.util.stream.Stream;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
@@ -433,6 +434,18 @@ public PagedList getPagedAspects(final RestoreIndicesArgs args) {
.findPagedList();
}
+ @Override
+ @Nonnull
+ public Stream streamAspects(String entityName, String aspectName) {
+ ExpressionList exp = _server.find(EbeanAspectV2.class)
+ .select(EbeanAspectV2.ALL_COLUMNS)
+ .where()
+ .eq(EbeanAspectV2.VERSION_COLUMN, ASPECT_LATEST_VERSION)
+ .eq(EbeanAspectV2.ASPECT_COLUMN, aspectName)
+ .like(EbeanAspectV2.URN_COLUMN, "urn:li:" + entityName + ":%");
+ return exp.query().findStream().map(EbeanAspectV2::toEntityAspect);
+ }
+
@Override
@Nonnull
public Iterable listAllUrns(int start, int pageSize) {
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/graph/elastic/ElasticSearchGraphService.java b/metadata-io/src/main/java/com/linkedin/metadata/graph/elastic/ElasticSearchGraphService.java
index 02e36af343b07..5fdf4d45ffa3b 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/graph/elastic/ElasticSearchGraphService.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/graph/elastic/ElasticSearchGraphService.java
@@ -318,7 +318,7 @@ public void removeEdgesFromNode(
public void configure() {
log.info("Setting up elastic graph index");
try {
- for (ReindexConfig config : getReindexConfigs()) {
+ for (ReindexConfig config : buildReindexConfigs()) {
_indexBuilder.buildIndex(config);
}
} catch (IOException e) {
@@ -327,7 +327,7 @@ public void configure() {
}
@Override
- public List getReindexConfigs() throws IOException {
+ public List buildReindexConfigs() throws IOException {
return List.of(_indexBuilder.buildReindexState(_indexConvention.getIndexName(INDEX_NAME),
GraphRelationshipMappingsBuilder.getMappings(), Collections.emptyMap()));
}
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/ElasticSearchService.java b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/ElasticSearchService.java
index bf4dffe9e5fb8..ef5a555e95ba8 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/ElasticSearchService.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/ElasticSearchService.java
@@ -46,8 +46,8 @@ public void configure() {
}
@Override
- public List getReindexConfigs() {
- return indexBuilders.getReindexConfigs();
+ public List buildReindexConfigs() {
+ return indexBuilders.buildReindexConfigs();
}
@Override
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ESIndexBuilder.java b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ESIndexBuilder.java
index 10c2fd725dca9..43431e93622f7 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ESIndexBuilder.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ESIndexBuilder.java
@@ -206,12 +206,7 @@ public void buildIndex(ReindexConfig indexState) throws IOException {
// no need to reindex and only new mappings or dynamic settings
// Just update the additional mappings
- if (indexState.isPureMappingsAddition()) {
- log.info("Updating index {} mappings in place.", indexState.name());
- PutMappingRequest request = new PutMappingRequest(indexState.name()).source(indexState.targetMappings());
- _searchClient.indices().putMapping(request, RequestOptions.DEFAULT);
- log.info("Updated index {} with new mappings", indexState.name());
- }
+ applyMappings(indexState, true);
if (indexState.requiresApplySettings()) {
UpdateSettingsRequest request = new UpdateSettingsRequest(indexState.name());
@@ -234,6 +229,26 @@ public void buildIndex(ReindexConfig indexState) throws IOException {
}
}
+ /**
+ * Apply mappings changes if reindex is not required
+ * @param indexState the state of the current and target index settings/mappings
+ * @param suppressError during reindex logic this is not an error, for structured properties it is an error
+ * @throws IOException communication issues with ES
+ */
+ public void applyMappings(ReindexConfig indexState, boolean suppressError) throws IOException {
+ if (indexState.isPureMappingsAddition()) {
+ log.info("Updating index {} mappings in place.", indexState.name());
+ PutMappingRequest request = new PutMappingRequest(indexState.name()).source(indexState.targetMappings());
+ _searchClient.indices().putMapping(request, RequestOptions.DEFAULT);
+ log.info("Updated index {} with new mappings", indexState.name());
+ } else {
+ if (!suppressError) {
+ log.error("Attempted to apply invalid mappings. Current: {} Target: {}", indexState.currentMappings(),
+ indexState.targetMappings());
+ }
+ }
+ }
+
public String reindexInPlaceAsync(String indexAlias, @Nullable QueryBuilder filterQuery, BatchWriteOperationsOptions options, ReindexConfig config)
throws Exception {
GetAliasesResponse aliasesResponse = _searchClient.indices().getAlias(
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/EntityIndexBuilder.java b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/EntityIndexBuilder.java
deleted file mode 100644
index 04c9f1993ff35..0000000000000
--- a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/EntityIndexBuilder.java
+++ /dev/null
@@ -1,35 +0,0 @@
-package com.linkedin.metadata.search.elasticsearch.indexbuilder;
-
-import com.linkedin.metadata.models.EntitySpec;
-import java.io.IOException;
-import java.util.List;
-import java.util.Map;
-
-import com.linkedin.metadata.shared.ElasticSearchIndexed;
-import lombok.RequiredArgsConstructor;
-import lombok.extern.slf4j.Slf4j;
-
-
-@Slf4j
-@RequiredArgsConstructor
-public class EntityIndexBuilder implements ElasticSearchIndexed {
- private final ESIndexBuilder indexBuilder;
- private final EntitySpec entitySpec;
- private final SettingsBuilder settingsBuilder;
- private final String indexName;
-
- @Override
- public void reindexAll() throws IOException {
- log.info("Setting up index: {}", indexName);
- for (ReindexConfig config : getReindexConfigs()) {
- indexBuilder.buildIndex(config);
- }
- }
-
- @Override
- public List getReindexConfigs() throws IOException {
- Map mappings = MappingsBuilder.getMappings(entitySpec);
- Map settings = settingsBuilder.getSettings();
- return List.of(indexBuilder.buildReindexState(indexName, mappings, settings));
- }
-}
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/EntityIndexBuilders.java b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/EntityIndexBuilders.java
index f38418058ca6d..56cb26b09dc33 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/EntityIndexBuilders.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/EntityIndexBuilders.java
@@ -3,8 +3,10 @@
import com.linkedin.metadata.models.registry.EntityRegistry;
import com.linkedin.metadata.shared.ElasticSearchIndexed;
import com.linkedin.metadata.utils.elasticsearch.IndexConvention;
+
import java.io.IOException;
import java.util.List;
+import java.util.Map;
import java.util.stream.Collectors;
import lombok.RequiredArgsConstructor;
@@ -14,32 +16,37 @@
@RequiredArgsConstructor
@Slf4j
public class EntityIndexBuilders implements ElasticSearchIndexed {
- private final ESIndexBuilder indexBuilder;
- private final EntityRegistry entityRegistry;
- private final IndexConvention indexConvention;
- private final SettingsBuilder settingsBuilder;
-
- @Override
- public void reindexAll() {
- for (ReindexConfig config : getReindexConfigs()) {
- try {
- indexBuilder.buildIndex(config);
- } catch (IOException e) {
- throw new RuntimeException(e);
- }
- }
- }
-
- @Override
- public List getReindexConfigs() {
- return entityRegistry.getEntitySpecs().values().stream().flatMap(entitySpec -> {
- try {
- return new EntityIndexBuilder(indexBuilder, entitySpec, settingsBuilder, indexConvention.getIndexName(entitySpec))
- .getReindexConfigs().stream();
- } catch (IOException e) {
+ private final ESIndexBuilder indexBuilder;
+ private final EntityRegistry entityRegistry;
+ private final IndexConvention indexConvention;
+ private final SettingsBuilder settingsBuilder;
+
+ public ESIndexBuilder getIndexBuilder() {
+ return indexBuilder;
+ }
+
+ @Override
+ public void reindexAll() {
+ for (ReindexConfig config : buildReindexConfigs()) {
+ try {
+ indexBuilder.buildIndex(config);
+ } catch (IOException e) {
+ throw new RuntimeException(e);
+ }
+ }
+ }
+
+ @Override
+ public List buildReindexConfigs() {
+ Map settings = settingsBuilder.getSettings();
+ return entityRegistry.getEntitySpecs().values().stream().map(entitySpec -> {
+ try {
+ Map mappings = MappingsBuilder.getMappings(entitySpec);
+ return indexBuilder.buildReindexState(indexConvention.getIndexName(entitySpec), mappings, settings);
+ } catch (IOException e) {
throw new RuntimeException(e);
- }
}
- ).collect(Collectors.toList());
- }
+ }
+ ).collect(Collectors.toList());
+ }
}
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/MappingsBuilder.java b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/MappingsBuilder.java
index b3e05d966e36b..004b2e0a2adc4 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/MappingsBuilder.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/MappingsBuilder.java
@@ -51,6 +51,8 @@ public static Map getPartialNgramConfigWithOverrides(Map getMappings(@Nonnull final EntitySpec entitySp
mappings.put("urn", getMappingsForUrn());
mappings.put("runId", getMappingsForRunId());
- return ImmutableMap.of("properties", mappings);
+ return ImmutableMap.of(PROPERTIES, mappings);
}
private static Map getMappingsForUrn() {
@@ -98,42 +100,9 @@ private static Map getMappingsForField(@Nonnull final Searchable
Map mappings = new HashMap<>();
Map mappingForField = new HashMap<>();
if (fieldType == FieldType.KEYWORD) {
- mappingForField.put(TYPE, KEYWORD);
- mappingForField.put(NORMALIZER, KEYWORD_NORMALIZER);
- // Add keyword subfield without lowercase filter
- mappingForField.put(FIELDS, ImmutableMap.of(KEYWORD, KEYWORD_TYPE_MAP));
+ mappingForField.putAll(getMappingsForKeyword());
} else if (fieldType == FieldType.TEXT || fieldType == FieldType.TEXT_PARTIAL || fieldType == FieldType.WORD_GRAM) {
- mappingForField.put(TYPE, KEYWORD);
- mappingForField.put(NORMALIZER, KEYWORD_NORMALIZER);
- Map subFields = new HashMap<>();
- if (fieldType == FieldType.TEXT_PARTIAL || fieldType == FieldType.WORD_GRAM) {
- subFields.put(NGRAM, getPartialNgramConfigWithOverrides(
- ImmutableMap.of(
- ANALYZER, PARTIAL_ANALYZER
- )
- ));
- if (fieldType == FieldType.WORD_GRAM) {
- for (Map.Entry entry : Map.of(
- WORD_GRAMS_LENGTH_2, WORD_GRAM_2_ANALYZER,
- WORD_GRAMS_LENGTH_3, WORD_GRAM_3_ANALYZER,
- WORD_GRAMS_LENGTH_4, WORD_GRAM_4_ANALYZER).entrySet()) {
- String fieldName = entry.getKey();
- String analyzerName = entry.getValue();
- subFields.put(fieldName, ImmutableMap.of(
- TYPE, TEXT,
- ANALYZER, analyzerName
- ));
- }
- }
- }
- subFields.put(DELIMITED, ImmutableMap.of(
- TYPE, TEXT,
- ANALYZER, TEXT_ANALYZER,
- SEARCH_ANALYZER, TEXT_SEARCH_ANALYZER,
- SEARCH_QUOTE_ANALYZER, CUSTOM_QUOTE_ANALYZER));
- // Add keyword subfield without lowercase filter
- subFields.put(KEYWORD, KEYWORD_TYPE_MAP);
- mappingForField.put(FIELDS, subFields);
+ mappingForField.putAll(getMappingsForSearchText(fieldType));
} else if (fieldType == FieldType.BROWSE_PATH) {
mappingForField.put(TYPE, TEXT);
mappingForField.put(FIELDS,
@@ -189,6 +158,51 @@ private static Map getMappingsForField(@Nonnull final Searchable
return mappings;
}
+ private static Map getMappingsForKeyword() {
+ Map mappingForField = new HashMap<>();
+ mappingForField.put(TYPE, KEYWORD);
+ mappingForField.put(NORMALIZER, KEYWORD_NORMALIZER);
+ // Add keyword subfield without lowercase filter
+ mappingForField.put(FIELDS, ImmutableMap.of(KEYWORD, KEYWORD_TYPE_MAP));
+ return mappingForField;
+ }
+
+ private static Map getMappingsForSearchText(FieldType fieldType) {
+ Map mappingForField = new HashMap<>();
+ mappingForField.put(TYPE, KEYWORD);
+ mappingForField.put(NORMALIZER, KEYWORD_NORMALIZER);
+ Map subFields = new HashMap<>();
+ if (fieldType == FieldType.TEXT_PARTIAL || fieldType == FieldType.WORD_GRAM) {
+ subFields.put(NGRAM, getPartialNgramConfigWithOverrides(
+ ImmutableMap.of(
+ ANALYZER, PARTIAL_ANALYZER
+ )
+ ));
+ if (fieldType == FieldType.WORD_GRAM) {
+ for (Map.Entry entry : Map.of(
+ WORD_GRAMS_LENGTH_2, WORD_GRAM_2_ANALYZER,
+ WORD_GRAMS_LENGTH_3, WORD_GRAM_3_ANALYZER,
+ WORD_GRAMS_LENGTH_4, WORD_GRAM_4_ANALYZER).entrySet()) {
+ String fieldName = entry.getKey();
+ String analyzerName = entry.getValue();
+ subFields.put(fieldName, ImmutableMap.of(
+ TYPE, TEXT,
+ ANALYZER, analyzerName
+ ));
+ }
+ }
+ }
+ subFields.put(DELIMITED, ImmutableMap.of(
+ TYPE, TEXT,
+ ANALYZER, TEXT_ANALYZER,
+ SEARCH_ANALYZER, TEXT_SEARCH_ANALYZER,
+ SEARCH_QUOTE_ANALYZER, CUSTOM_QUOTE_ANALYZER));
+ // Add keyword subfield without lowercase filter
+ subFields.put(KEYWORD, KEYWORD_TYPE_MAP);
+ mappingForField.put(FIELDS, subFields);
+ return mappingForField;
+ }
+
private static Map getMappingsForSearchScoreField(
@Nonnull final SearchScoreFieldSpec searchScoreFieldSpec) {
return ImmutableMap.of(searchScoreFieldSpec.getSearchScoreAnnotation().getFieldName(),
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ReindexConfig.java b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ReindexConfig.java
index 4f5f2926d3da0..8b8a48f5d9cda 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ReindexConfig.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ReindexConfig.java
@@ -121,13 +121,14 @@ public ReindexConfig build() {
if (super.exists) {
/* Consider mapping changes */
MapDifference mappingsDiff = Maps.difference(
- (TreeMap) super.currentMappings.getOrDefault("properties", new TreeMap()),
- (TreeMap) super.targetMappings.getOrDefault("properties", new TreeMap()));
+ getOrDefault(super.currentMappings, List.of("properties")),
+ getOrDefault(super.targetMappings, List.of("properties")));
super.requiresApplyMappings = !mappingsDiff.entriesDiffering().isEmpty()
|| !mappingsDiff.entriesOnlyOnRight().isEmpty();
super.isPureMappingsAddition = super.requiresApplyMappings
&& mappingsDiff.entriesDiffering().isEmpty()
&& !mappingsDiff.entriesOnlyOnRight().isEmpty();
+
if (super.requiresApplyMappings && super.isPureMappingsAddition) {
log.info("Index: {} - New fields have been added to index. Adding: {}",
super.name, mappingsDiff.entriesOnlyOnRight());
@@ -171,8 +172,21 @@ public ReindexConfig build() {
return super.build();
}
+ private static TreeMap getOrDefault(Map map, List path) {
+ if (map == null) {
+ return new TreeMap<>();
+ }
+
+ TreeMap item = (TreeMap) map.getOrDefault(path.get(0), new TreeMap());
+ if (path.size() == 1) {
+ return item;
+ } else {
+ return getOrDefault(item, path.subList(1, path.size()));
+ }
+ }
+
private boolean isAnalysisEqual() {
- if (!super.targetSettings.containsKey("index")) {
+ if (super.targetSettings == null || !super.targetSettings.containsKey("index")) {
return true;
}
Map indexSettings = (Map) super.targetSettings.get("index");
@@ -186,7 +200,7 @@ private boolean isAnalysisEqual() {
}
private boolean isSettingsEqual() {
- if (!super.targetSettings.containsKey("index")) {
+ if (super.targetSettings == null || !super.targetSettings.containsKey("index")) {
return true;
}
Map indexSettings = (Map) super.targetSettings.get("index");
@@ -196,7 +210,7 @@ private boolean isSettingsEqual() {
}
private boolean isSettingsReindexRequired() {
- if (!super.targetSettings.containsKey("index")) {
+ if (super.targetSettings == null || !super.targetSettings.containsKey("index")) {
return false;
}
Map indexSettings = (Map) super.targetSettings.get("index");
diff --git a/metadata-io/src/main/java/com/linkedin/metadata/search/transformer/SearchDocumentTransformer.java b/metadata-io/src/main/java/com/linkedin/metadata/search/transformer/SearchDocumentTransformer.java
index 76f4736f2746e..49809cf933936 100644
--- a/metadata-io/src/main/java/com/linkedin/metadata/search/transformer/SearchDocumentTransformer.java
+++ b/metadata-io/src/main/java/com/linkedin/metadata/search/transformer/SearchDocumentTransformer.java
@@ -7,6 +7,7 @@
import com.linkedin.common.urn.Urn;
import com.linkedin.data.schema.DataSchema;
import com.linkedin.data.template.RecordTemplate;
+import com.linkedin.entity.client.SystemEntityClient;
import com.linkedin.metadata.models.AspectSpec;
import com.linkedin.metadata.models.EntitySpec;
import com.linkedin.metadata.models.SearchScoreFieldSpec;
@@ -21,6 +22,7 @@
import java.util.stream.Collectors;
import lombok.RequiredArgsConstructor;
+import lombok.Setter;
import lombok.extern.slf4j.Slf4j;
import javax.annotation.Nonnull;
@@ -30,6 +32,7 @@
* Class that provides a utility function that transforms the snapshot object into a search document
*/
@Slf4j
+@Setter
@RequiredArgsConstructor
public class SearchDocumentTransformer {
@@ -42,6 +45,8 @@ public class SearchDocumentTransformer {
// Maximum customProperties value length
private final int maxValueLength;
+ private SystemEntityClient entityClient;
+
private static final String BROWSE_PATH_V2_DELIMITER = "␟";
public Optional transformSnapshot(final RecordTemplate snapshot, final EntitySpec entitySpec,
@@ -72,14 +77,18 @@ public Optional transformAspect(
FieldExtractor.extractFields(aspect, aspectSpec.getSearchableFieldSpecs(), maxValueLength);
final Map> extractedSearchScoreFields =
FieldExtractor.extractFields(aspect, aspectSpec.getSearchScoreFieldSpecs(), maxValueLength);
- if (extractedSearchableFields.isEmpty() && extractedSearchScoreFields.isEmpty()) {
- return Optional.empty();
+
+ Optional result = Optional.empty();
+
+ if (!extractedSearchableFields.isEmpty() || !extractedSearchScoreFields.isEmpty()) {
+ final ObjectNode searchDocument = JsonNodeFactory.instance.objectNode();
+ searchDocument.put("urn", urn.toString());
+ extractedSearchableFields.forEach((key, values) -> setSearchableValue(key, values, searchDocument, forDelete));
+ extractedSearchScoreFields.forEach((key, values) -> setSearchScoreValue(key, values, searchDocument, forDelete));
+ result = Optional.of(searchDocument.toString());
}
- final ObjectNode searchDocument = JsonNodeFactory.instance.objectNode();
- searchDocument.put("urn", urn.toString());
- extractedSearchableFields.forEach((key, values) -> setSearchableValue(key, values, searchDocument, forDelete));
- extractedSearchScoreFields.forEach((key, values) -> setSearchScoreValue(key, values, searchDocument, forDelete));
- return Optional.of(searchDocument.toString());
+
+ return result;
}
public void setSearchableValue(final SearchableFieldSpec fieldSpec, final List