Skip to content

Commit

Permalink
Merge branch 'datahub-project:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
anshbansal authored Nov 11, 2024
2 parents 24a1bca + 5094dab commit 5a302e5
Show file tree
Hide file tree
Showing 16 changed files with 270 additions and 166 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/gx-plugin.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ jobs:
extraPythonRequirement: "great-expectations~=0.16.0 numpy~=1.26.0"
- python-version: "3.11"
extraPythonRequirement: "great-expectations~=0.17.0"
- python-version: "3.11"
extraPythonRequirement: "great-expectations~=0.18.0"
fail-fast: false
steps:
- name: Set up JDK 17
Expand Down
6 changes: 5 additions & 1 deletion datahub-web-react/src/app/search/SearchablePage.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,11 @@ export const SearchablePage = ({ onSearch, onAutoComplete, children }: Props) =>
const formattedPath = location.pathname
.split('/')
.filter((word) => word !== '')
.map((word) => word.charAt(0).toUpperCase() + word.slice(1))
.map((rawWord) => {
// ie. personal-notifications -> Personal Notifications
const words = rawWord.split('-');
return words.map((word) => word.charAt(0).toUpperCase() + word.slice(1)).join(' ');
})
.join(' | ');

if (formattedPath) {
Expand Down
119 changes: 44 additions & 75 deletions docs/what/relationship.md
Original file line number Diff line number Diff line change
@@ -1,107 +1,76 @@
# What is a relationship?

A relationship is a named associate between exactly two [entities](entity.md), a source and a destination.
A relationship is a named associate between exactly two [entities](entity.md), a source and a destination.


<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-modeling.png"/>
</p>


From the above graph, a `Group` entity can be linked to a `User` entity via a `HasMember` relationship.
Note that the name of the relationship reflects the direction, i.e. pointing from `Group` to `User`.
This is due to the fact that the actual metadata aspect holding this information is associated with `Group`, rather than User.
Had the direction been reversed, the relationship would have been named `IsMemberOf` instead.
See [Direction of Relationships](#direction-of-relationships) for more discussions on relationship directionality.
A specific instance of a relationship, e.g. `urn:li:corpGroup:group1` has a member `urn:li:corpuser:user1`,
From the above graph, a `Group` entity can be linked to a `User` entity via a `HasMember` relationship.
Note that the name of the relationship reflects the direction, i.e. pointing from `Group` to `User`.
This is due to the fact that the actual metadata aspect holding this information is associated with `Group`, rather than
User.
Had the direction been reversed, the relationship would have been named `IsMemberOf` instead.
See [Direction of Relationships](#direction-of-relationships) for more discussions on relationship directionality.
A specific instance of a relationship, e.g. `urn:li:corpGroup:group1` has a member `urn:li:corpuser:user1`,
corresponds to an edge in the metadata graph.

Similar to an entity, a relationship can also be associated with optional attributes that are derived from the metadata.
For example, from the `Membership` metadata aspect shown below, we’re able to derive the `HasMember` relationship that links a specific `Group` to a specific `User`. We can also include additional attribute to the relationship, e.g. importance, which corresponds to the position of the specific member in the original membership array. This allows complex graph query that travel only relationships that match certain criteria, e.g. "returns only the top-5 most important members of this group."
Similar to the entity attributes, relationship attributes should only be added based on the expected query patterns to reduce the indexing cost.
Relationships are meant to be "entity-neutral". In other words, one would expect to use the same `OwnedBy` relationship
to link a `Dataset` to a `User` and to link a `Dashboard` to a `User`.
As Pegasus doesn’t allow typing a field using multiple URNs (because they’re all essentially strings), we resort to
using generic URN type for the source and destination.
We also introduce a `@Relationship` [annotation](../modeling/extending-the-metadata-model.md/#relationship) to
limit the allowed source and destination URN types.

```
namespace: com.linkedin.group
import com.linkedin.common.AuditStamp
import com.linkedin.common.CorpuserUrn
/**
* The membership metadata for a group
*/
record Membership {
/** Audit stamp for the last change */
modified: AuditStamp
/** Admin of the group */
admin: CorpuserUrn
/** Members of the group, ordered in descending importance */
members: array[CorpuserUrn]
}
```

Relationships are meant to be "entity-neutral". In other words, one would expect to use the same `OwnedBy` relationship to link a `Dataset` to a `User` and to link a `Dashboard` to a `User`. As Pegasus doesn’t allow typing a field using multiple URNs (because they’re all essentially strings), we resort to using generic URN type for the source and destination.
We also introduce a `@pairings` [annotation](https://linkedin.github.io/rest.li/pdl_migration#shorthand-for-custom-properties) to limit the allowed source and destination URN types.

While it’s possible to model relationships in rest.li as [association resources](https://linkedin.github.io/rest.li/modeling/modeling#association), which often get stored as mapping tables, it is far more common to model them as "foreign keys" field in a metadata aspect. For instance, the `Ownership` aspect is likely to contain an array of owner’s corpuser URNs.
While it’s possible to model relationships in rest.li
as [association resources](https://linkedin.github.io/rest.li/modeling/modeling#association), which often get stored as
mapping tables, it is far more common to model them as "foreign keys" field in a metadata aspect. For instance,
the `Ownership` aspect is likely to contain an array of owner’s corpUser URNs.

Below is an example of how a relationship is modeled in PDL. Note that:
1. As the `source` and `destination` are of generic URN type, we’re able to factor them out to a common `BaseRelationship` model.
2. Each model is expected to have a `@pairings` annotation that is an array of all allowed source-destination URN pairs.
3. Unlike entity attributes, there’s no requirement on making all relationship attributes optional since relationships do not support partial updates.

1. This aspect, `nativeGroupMembership` would be associated with a `corpUser`
2. The `corpUser`'s aspect points to one or more parent entities of type `corpGroup`

```
namespace com.linkedin.metadata.relationship
namespace com.linkedin.identity
import com.linkedin.common.Urn
/**
* Common fields that apply to all relationships
* Carries information about the native CorpGroups a user is in.
*/
record BaseRelationship {
/**
* Urn for the source of the relationship
*/
source: Urn
/**
* Urn for the destination of the relationship
*/
destination: Urn
@Aspect = {
"name": "nativeGroupMembership"
}
```

```
namespace com.linkedin.metadata.relationship
/**
* Data model for a has-member relationship
*/
@pairings = [ {
"destination" : "com.linkedin.common.urn.CorpGroupUrn",
"source" : "com.linkedin.common.urn.CorpUserUrn"
} ]
record HasMembership includes BaseRelationship
{
/**
* The importance of the membership
*/
importance: int
record NativeGroupMembership {
@Relationship = {
"/*": {
"name": "IsMemberOfNativeGroup",
"entityTypes": [ "corpGroup" ]
}
}
nativeGroups: array[Urn]
}
```

## Direction of Relationships

As relationships are modeled as directed edges between nodes, it’s natural to ask which way should it be pointing,
or should there be edges going both ways? The answer is, "doesn’t really matter." It’s rather an aesthetic choice than technical one.
As relationships are modeled as directed edges between nodes, it’s natural to ask which way should it be pointing,
or should there be edges going both ways? The answer is, "doesn’t really matter." It’s rather an aesthetic choice than
technical one.

For one, the actual direction doesn’t really impact the execution of graph queries. Most graph DBs are fully capable of traversing edges in reverse direction efficiently.
For one, the actual direction doesn’t really impact the execution of graph queries. Most graph DBs are fully capable of
traversing edges in reverse direction efficiently.

That being said, generally there’s a more "natural way" to specify the direction of a relationship, which closely relate to how the metadata is stored. For example, the membership information for an LDAP group is generally stored as a list in group’s metadata. As a result, it’s more natural to model a `HasMember` relationship that points from a group to a member, instead of a `IsMemberOf` relationship pointing from member to group.
That being said, generally there’s a more "natural way" to specify the direction of a relationship, which closely relate
to how the metadata is stored. For example, the membership information for an LDAP group is generally stored as a list
in group’s metadata. As a result, it’s more natural to model a `HasMember` relationship that points from a group to a
member, instead of a `IsMemberOf` relationship pointing from member to group.

## High Cardinality Relationships

See [this doc](../advanced/high-cardinality.md) for suggestions on how to best model relationships with high cardinality.
See [this doc](../advanced/high-cardinality.md) for suggestions on how to best model relationships with high
cardinality.
Original file line number Diff line number Diff line change
@@ -1,14 +1,9 @@
package com.datahub.util.validator;

import com.linkedin.common.urn.Urn;
import com.linkedin.data.DataList;
import com.linkedin.data.DataMap;
import com.linkedin.data.schema.RecordDataSchema;
import com.linkedin.data.schema.UnionDataSchema;
import com.linkedin.data.template.RecordTemplate;
import com.linkedin.data.template.UnionTemplate;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import javax.annotation.Nonnull;
Expand Down Expand Up @@ -63,8 +58,6 @@ public static void validateRelationshipSchema(@Nonnull RecordDataSchema schema)
"Relationship '%s' contains a field '%s' that makes use of a disallowed type '%s'.",
className, field.getName(), field.getType().getType());
});

validatePairings(schema);
}

/**
Expand Down Expand Up @@ -109,59 +102,4 @@ public static void validateRelationshipUnionSchema(
relationshipClassName);
}
}

private static void validatePairings(@Nonnull RecordDataSchema schema) {

final String className = schema.getBindingName();

Map<String, Object> properties = schema.getProperties();
if (!properties.containsKey("pairings")) {
ValidationUtils.invalidSchema(
"Relationship '%s' must contain a 'pairings' property", className);
}

DataList pairings = (DataList) properties.get("pairings");
Set<Pair> registeredPairs = new HashSet<>();
pairings.stream()
.forEach(
obj -> {
DataMap map = (DataMap) obj;
if (!map.containsKey("source") || !map.containsKey("destination")) {
ValidationUtils.invalidSchema(
"Relationship '%s' contains an invalid 'pairings' item. "
+ "Each item must contain a 'source' and 'destination' properties.",
className);
}

String sourceUrn = map.getString("source");
if (!isValidUrnClass(sourceUrn)) {
ValidationUtils.invalidSchema(
"Relationship '%s' contains an invalid item in 'pairings'. %s is not a valid URN class name.",
className, sourceUrn);
}

String destinationUrn = map.getString("destination");
if (!isValidUrnClass(destinationUrn)) {
ValidationUtils.invalidSchema(
"Relationship '%s' contains an invalid item in 'pairings'. %s is not a valid URN class name.",
className, destinationUrn);
}

Pair pair = new Pair(sourceUrn, destinationUrn);
if (registeredPairs.contains(pair)) {
ValidationUtils.invalidSchema(
"Relationship '%s' contains a repeated 'pairings' item (%s, %s)",
className, sourceUrn, destinationUrn);
}
registeredPairs.add(pair);
});
}

private static boolean isValidUrnClass(String className) {
try {
return Urn.class.isAssignableFrom(Class.forName(className));
} catch (ClassNotFoundException e) {
throw new RuntimeException(e);
}
}
}
2 changes: 1 addition & 1 deletion metadata-ingestion-modules/gx-plugin/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def get_long_description():
# GE added handling for higher version of jinja2 in version 0.15.12
# https://github.com/great-expectations/great_expectations/pull/5382/files
# TODO: support GX 0.18.0
"great-expectations>=0.15.12, <0.18.0",
"great-expectations>=0.15.12, <1.0.0",
# datahub does not depend on traitlets directly but great expectations does.
# https://github.com/ipython/traitlets/issues/741
"traitlets<5.2.2",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union

import datahub.emitter.mce_builder as builder
import packaging.version
from datahub.cli.env_utils import get_boolean_env_variable
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
Expand Down Expand Up @@ -59,6 +60,16 @@
from sqlalchemy.engine.base import Connection, Engine
from sqlalchemy.engine.url import make_url

# TODO: move this and version check used in tests to some common module
try:
from great_expectations import __version__ as GX_VERSION # type: ignore

has_name_positional_arg = packaging.version.parse(
GX_VERSION
) >= packaging.version.Version("0.18.0")
except Exception:
has_name_positional_arg = False

if TYPE_CHECKING:
from great_expectations.data_context.types.resource_identifiers import (
GXCloudIdentifier,
Expand All @@ -78,6 +89,8 @@ class DataHubValidationAction(ValidationAction):
def __init__(
self,
data_context: AbstractDataContext,
# this would capture `name` positional arg added in GX 0.18.0
*args: Union[str, Any],
server_url: str,
env: str = builder.DEFAULT_ENV,
platform_alias: Optional[str] = None,
Expand All @@ -94,7 +107,12 @@ def __init__(
name: str = "DataHubValidationAction",
):

super().__init__(data_context)
if has_name_positional_arg:
if len(args) >= 1 and isinstance(args[0], str):
name = args[0]
super().__init__(data_context, name)
else:
super().__init__(data_context)
self.server_url = server_url
self.env = env
self.platform_alias = platform_alias
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
)
from great_expectations.core.id_dict import IDDict
from great_expectations.core.run_identifier import RunIdentifier
from great_expectations.data_context import DataContext, FileDataContext
from great_expectations.data_context import FileDataContext
from great_expectations.data_context.types.resource_identifiers import (
ExpectationSuiteIdentifier,
ValidationResultIdentifier,
Expand All @@ -52,7 +52,7 @@


@pytest.fixture(scope="function")
def ge_data_context(tmp_path: str) -> DataContext:
def ge_data_context(tmp_path: str) -> FileDataContext:
return FileDataContext.create(tmp_path)


Expand Down Expand Up @@ -233,7 +233,7 @@ def ge_validation_result_suite_id_pandas() -> ValidationResultIdentifier:
@mock.patch("datahub.emitter.rest_emitter.DatahubRestEmitter.emit_mcp", autospec=True)
def test_DataHubValidationAction_sqlalchemy(
mock_emitter: mock.MagicMock,
ge_data_context: DataContext,
ge_data_context: FileDataContext,
ge_validator_sqlalchemy: Validator,
ge_validation_result_suite: ExpectationSuiteValidationResult,
ge_validation_result_suite_id: ValidationResultIdentifier,
Expand Down Expand Up @@ -337,7 +337,7 @@ def test_DataHubValidationAction_sqlalchemy(
@mock.patch("datahub.emitter.rest_emitter.DatahubRestEmitter.emit_mcp", autospec=True)
def test_DataHubValidationAction_pandas(
mock_emitter: mock.MagicMock,
ge_data_context: DataContext,
ge_data_context: FileDataContext,
ge_validator_pandas: Validator,
ge_validation_result_suite_pandas: ExpectationSuiteValidationResult,
ge_validation_result_suite_id_pandas: ValidationResultIdentifier,
Expand Down Expand Up @@ -399,7 +399,7 @@ def test_DataHubValidationAction_pandas(


def test_DataHubValidationAction_graceful_failure(
ge_data_context: DataContext,
ge_data_context: FileDataContext,
ge_validator_sqlalchemy: Validator,
ge_validation_result_suite: ExpectationSuiteValidationResult,
ge_validation_result_suite_id: ValidationResultIdentifier,
Expand All @@ -418,7 +418,7 @@ def test_DataHubValidationAction_graceful_failure(


def test_DataHubValidationAction_not_supported(
ge_data_context: DataContext,
ge_data_context: FileDataContext,
ge_validator_spark: Validator,
ge_validation_result_suite: ExpectationSuiteValidationResult,
ge_validation_result_suite_id: ValidationResultIdentifier,
Expand Down
5 changes: 4 additions & 1 deletion metadata-ingestion/src/datahub/cli/json_file.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
import logging

from datahub.ingestion.api.common import PipelineContext
from datahub.ingestion.source.file import GenericFileSource

logger = logging.getLogger(__name__)


def check_mce_file(filepath: str) -> str:
mce_source = GenericFileSource.create({"filename": filepath}, None)
mce_source = GenericFileSource.create(
{"filename": filepath}, PipelineContext(run_id="json-file")
)
for _ in mce_source.get_workunits():
pass
if len(mce_source.get_report().failures):
Expand Down
Loading

0 comments on commit 5a302e5

Please sign in to comment.