Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SemanticModel Node Type #7769

Merged
merged 15 commits into from
Jun 8, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .changes/unreleased/Features-20230606-165351.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Features
body: Added support for parsing and serializaing semantic models
time: 2023-06-06T16:53:51.117429-04:00
custom:
Author: peterallenwebb
Issue: 7499 7503
1 change: 1 addition & 0 deletions core/dbt/contracts/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,7 @@ class SchemaSourceFile(BaseSourceFile):
groups: List[str] = field(default_factory=list)
# node patches contain models, seeds, snapshots, analyses
ndp: List[str] = field(default_factory=list)
semantic_models: List[str] = field(default_factory=list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantic_nodes? (see other comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshank To clarify, do you think we should change the name in the yml file, or just for our internal property names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for for the internal property names. I think "semantic_models" in the yaml files is fine, it matches the "models" section we already have. Naming is so hard.

# any macro patches in this file by macro unique_id.
mcp: Dict[str, str] = field(default_factory=dict)
# any source patches in this file. The entries are package, name pairs
Expand Down
29 changes: 20 additions & 9 deletions core/dbt/contracts/graph/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,21 +25,22 @@
from dbt.contracts.publication import ProjectDependencies, PublicationConfig, PublicModel

from dbt.contracts.graph.nodes import (
Macro,
BaseNode,
Documentation,
SourceDefinition,
GenericTestNode,
Exposure,
Metric,
GenericTestNode,
GraphMemberNode,
Group,
UnpatchedSourceDefinition,
Macro,
ManifestNode,
GraphMemberNode,
ResultNode,
BaseNode,
ManifestOrPublicNode,
Metric,
ModelNode,
RelationalNode,
ResultNode,
SemanticModel,
SourceDefinition,
UnpatchedSourceDefinition,
)
from dbt.contracts.graph.unparsed import SourcePatch, NodeVersion, UnparsedVersion
from dbt.contracts.graph.manifest_upgrade import upgrade_manifest_json
Expand Down Expand Up @@ -706,6 +707,7 @@ class Manifest(MacroMethods, DataClassMessagePackMixin, dbtClassMixin):
public_nodes: MutableMapping[str, PublicModel] = field(default_factory=dict)
project_dependencies: Optional[ProjectDependencies] = None
publications: MutableMapping[str, PublicationConfig] = field(default_factory=dict)
semantic_models: MutableMapping[str, SemanticModel] = field(default_factory=dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing pattern is the node objects use "Model" and the dictionaries use "nodes", that is Model nodes are found in "nodes" dictionary, "PublicModel" nodes are found in public_nodes. Could this be the "semantic_nodes" to preserve that pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, do you think we just rely on the existing nodes dictionary, and not have a special case for SemanticModel objects? Would it matter that the SemanticModels are not yet linked into the compiled graph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original "metrics" went in a "metrics" dictionary... This has been a bit ad-hoc and not closely thought through, but it feels like the existing pattern is that SQL things go in nodes (plus seeds...) and yaml-generated things go in their own dictionaries. There are some assumptions in the rest of the code that match, such as the links in the file objects, etc. So I think it still makes sense to put semantic models in their own dictionary. If anything I'd be tempted to separate out some of the existing things that are in the nodes dictionary into their own dictionaries and maybe make some combined "indexes" (for cases where we don't want to loop over "nodes" all the time...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to mention that one of the reasons for going in the direction of more individual dictionaries rather than less, is that currently jsonschema and deserializers can't always correctly guess the classes of the objects in the "nodes" dictionary, leading to other hacky things like the big deserialization if statement in the _deserialize method in ParsedNode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all really good to know. It seems like the right direction to me.


_doc_lookup: Optional[DocLookup] = field(
default=None, metadata={"serialize": lambda x: None, "deserialize": lambda x: None}
Expand Down Expand Up @@ -894,7 +896,7 @@ def build_group_map(self):
group_map[node.group].append(node.unique_id)
self.group_map = group_map

def writable_manifest(self):
def writable_manifest(self) -> "WritableManifest":
self.build_parent_and_child_maps()
self.build_group_map()
return WritableManifest(
Expand All @@ -912,6 +914,7 @@ def writable_manifest(self):
child_map=self.child_map,
parent_map=self.parent_map,
group_map=self.group_map,
semantic_models=self.semantic_models,
)

def write(self, path):
Expand Down Expand Up @@ -1246,6 +1249,11 @@ def add_doc(self, source_file: SourceFile, doc: Documentation):
self.docs[doc.unique_id] = doc
source_file.docs.append(doc.unique_id)

def add_semantic_model(self, source_file: SchemaSourceFile, semantic_model: SemanticModel):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
_check_duplicates(semantic_model, self.semantic_models)
self.semantic_models[semantic_model.unique_id] = semantic_model
source_file.semantic_models.append(semantic_model.unique_id)

# end of methods formerly in ParseResult

# Provide support for copy.deepcopy() - we just need to avoid the lock!
Expand Down Expand Up @@ -1345,6 +1353,9 @@ class WritableManifest(ArtifactMixin):
public_nodes: Mapping[UniqueID, PublicModel] = field(
metadata=dict(description=("The public models used in the dbt project"))
)
semantic_models: Mapping[UniqueID, SemanticModel] = field(
metadata=dict(description=("The semantic models defined in the dbt project"))
)
metadata: ManifestMetadata = field(
metadata=dict(
description="Metadata about the manifest",
Expand Down
2 changes: 2 additions & 0 deletions core/dbt/contracts/graph/manifest_upgrade.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,4 +127,6 @@ def upgrade_manifest_json(manifest: dict) -> dict:
if "root_path" in doc_content:
del doc_content["root_path"]
doc_content["resource_type"] = "doc"
if "semantic_models" not in manifest:
manifest["semantic_models"] = {}
return manifest
66 changes: 50 additions & 16 deletions core/dbt/contracts/graph/nodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,29 +6,23 @@
import hashlib

from mashumaro.types import SerializableType
from typing import (
Optional,
Union,
List,
Dict,
Any,
Sequence,
Tuple,
Iterator,
)
from typing import Optional, Union, List, Dict, Any, Sequence, Tuple, Iterator, Protocol

from dbt.dataclass_schema import dbtClassMixin, ExtensibleDbtClassMixin

from dbt.clients.system import write_file
from dbt.contracts.files import FileHash
from dbt.contracts.graph.unparsed import (
Dimension,
Docs,
Entity,
ExposureType,
ExternalTable,
FreshnessThreshold,
HasYamlMetadata,
MacroArgument,
MaturityType,
Measure,
MetricFilter,
MetricTime,
Owner,
Expand Down Expand Up @@ -62,12 +56,6 @@
EmptySnapshotConfig,
SnapshotConfig,
)
import sys

if sys.version_info >= (3, 8):
from typing import Protocol
else:
from typing_extensions import Protocol


# =====================================================================
Expand Down Expand Up @@ -564,6 +552,30 @@ def depends_on_macros(self):
return self.depends_on.macros


@dataclass
class FileSlice(dbtClassMixin, Replaceable):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
"""Provides file slice level context about what something was created from.

Implementation of the dbt-semantic-interfaces `FileSlice` protocol
"""

filename: str
content: str
start_line_number: int
end_line_number: int


@dataclass
class Metadata(dbtClassMixin, Replaceable):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
"""Provides file context about what something was created from.

Implementation of the dbt-semantic-interfaces `Metadata` protocol
"""

repo_file_path: str
file_slice: FileSlice


# ====================================
# CompiledNode subclasses
# ====================================
Expand Down Expand Up @@ -1411,6 +1423,28 @@ class Group(BaseNode):
resource_type: NodeType = field(metadata={"restrict": [NodeType.Group]})


# ====================================
# SemanticModel and related classes
# ====================================


@dataclass
class NodeRelation(dbtClassMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stu added in a StateRelation class, I think we can use that instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one doesn't actually have the relation_name in it, and I think they want that. Settling on a common relation class is the "larger issue" that I mentioned we didn't need to solve at this particular moment.

alias: str
schema_name: str # TODO: Could this be called simply "schema" so we could reuse StateRelation?
database: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this comment -- does this still need to include a unified relation_name that respects quoting + include policies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do -- we'll probably need the adapter.Relation to do something like this during compilation:

adapter = get_adapter(self.config)
relation_cls = adapter.Relation
relation_name = str(relation_cls.create_from(self.config, node))

(from https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/compilation.py#L486-L488)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the SemanticModels actually get compiled, since they're yaml-only. There is a question of whether they need the individual pieces (identifier/schema/database) or just the relation_name...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QMalcolm I'm not sure of the answer to this, but I suspect the answer may be yes. What makes sense from the perspective of MetricFlow integration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichelleArk Yep! We need to include the unified relation_name, and it's desirable that it respects quoting + include policies.

As for whether we need both the individual pieces as well as the relation_name, for this PR yes, long term we don't have to. MetricFlow is fine with having the parts or the relation_name. It was suggested that we use this node_relation object as it was a pattern that was already used elsewhere, thus the shape of the object currently. Though MetricFlow currently uses the relation_name attribute as seen here, and ignores the individual properties. However, the individual properties as well as the relation_name are currently required by the protocol. So for the scope of this PR, we should include the individual properties as well as the relation_name. Otherwise we'd first need to update and release a new DSI (which would include a number of other schema changes that have been made), and then propagate all those changes in core. Although the node_relation changes might be trivial in core, propagating the other changes with the new DSI could eat a fair bit of time.



@dataclass
class SemanticModel(GraphNode):
description: Optional[str]
model: str
node_relation: Optional[NodeRelation]
entities: Sequence[Entity]
measures: Sequence[Measure]
dimensions: Sequence[Dimension]


# ====================================
# Patches
# ====================================
Expand Down
54 changes: 54 additions & 0 deletions core/dbt/contracts/graph/unparsed.py
Original file line number Diff line number Diff line change
Expand Up @@ -661,6 +661,60 @@ def validate(cls, data):
raise ValidationError("Group owner must have at least one of 'name' or 'email'.")


#
# semantic interfaces unparsed objects
#


@dataclass
class Entity(dbtClassMixin, Replaceable):
peterallenwebb marked this conversation as resolved.
Show resolved Hide resolved
name: str
type: str # actually an enum
description: Optional[str] = None
role: Optional[str] = None
expr: Optional[str] = None


@dataclass
class MeasureAggregationParameters(dbtClassMixin, Replaceable):
percentile: Optional[float] = None
use_discrete_percentile: bool = False
use_approximate_percentile: bool = False


@dataclass
class Measure(dbtClassMixin, Replaceable):
name: str
agg: str # actually an enum
description: Optional[str] = None
create_metric: Optional[bool] = None
expr: Optional[str] = None
agg_params: Optional[MeasureAggregationParameters] = None
non_additive_dimension: Optional[Dict[str, Any]] = None
agg_time_dimension: Optional[str] = None


@dataclass
class Dimension(dbtClassMixin, Replaceable):
name: str
type: str # actually an enum
description: Optional[str] = None
is_partition: Optional[bool] = False
type_params: Optional[Dict[str, Any]] = None
expr: Optional[str] = None
# TODO metadata: Optional[Metadata] (this would actually be the YML for the dimension)


@dataclass
class UnparsedSemanticModel(dbtClassMixin, Replaceable):
name: str
description: Optional[str]
model: str # looks like "ref(...)"
entities: List[Entity] = field(default_factory=list)
measures: List[Measure] = field(default_factory=list)
dimensions: List[Dimension] = field(default_factory=list)


def normalize_date(d: Optional[datetime.date]) -> Optional[datetime.datetime]:
"""Convert date to datetime (at midnight), and add local time zone if naive"""
if d is None:
Expand Down
1 change: 1 addition & 0 deletions core/dbt/node_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ class NodeType(StrEnum):
Exposure = "exposure"
Metric = "metric"
Group = "group"
SemanticModel = "semantic model"

@classmethod
def executable(cls) -> List["NodeType"]:
Expand Down
25 changes: 25 additions & 0 deletions core/dbt/parser/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
DeprecatedReference,
UpcomingReferenceDeprecation,
)
from dbt_extractor import py_extract_from_source # type: ignore
from dbt.logger import DbtProcessState
from dbt.node_types import NodeType, AccessType
from dbt.clients.jinja import get_rendered, MacroStack
Expand Down Expand Up @@ -99,6 +100,7 @@
ManifestNode,
ResultNode,
ModelNode,
NodeRelation,
)
from dbt.contracts.graph.unparsed import NodeVersion
from dbt.contracts.util import Writable
Expand Down Expand Up @@ -528,6 +530,7 @@ def load(self):
self.process_refs(self.root_project.project_name)
self.process_docs(self.root_project)
self.process_metrics(self.root_project)
self.process_semantic_models()
self.check_valid_group_config()

# update tracking data
Expand Down Expand Up @@ -1176,6 +1179,28 @@ def process_metrics(self, config: RuntimeConfig):
continue
_process_metrics_for_node(self.manifest, current_project, exposure)

def process_semantic_models(self) -> None:
for semantic_model in self.manifest.semantic_models.values():
if semantic_model.model:
statically_parsed = py_extract_from_source(f"{{{{ {semantic_model.model} }}}}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call can throw an ExtractionError (defined in dbt_extractor) -- would be good to catch it and give a throw user-facing error instead since it is likely to be a typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, and we'll follow up in the current sprint with #7822.

if statically_parsed["refs"]:

ref = statically_parsed["refs"][0]
if len(ref) == 2:
input_package_name, input_model_name = ref
else:
input_package_name, input_model_name = None, ref[0]

refd_node = self.manifest.ref_lookup.find(
input_model_name, input_package_name, None, self.manifest
Copy link
Contributor

@MichelleArk MichelleArk Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a #TODO comment here that links to #7688? refs with the version keyword won't be supported until 7688 is closed. At that point we can pass in the parsed version instead of always passing None to the ref lookup call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, and we'll follow up in the current sprint with #7822.

)
if isinstance(refd_node, ModelNode):
semantic_model.node_relation = NodeRelation(
alias=refd_node.alias,
schema_name=refd_node.schema,
database=refd_node.database,
)
MichelleArk marked this conversation as resolved.
Show resolved Hide resolved

# nodes: node and column descriptions
# sources: source and table descriptions, column descriptions
# macros: macro argument descriptions
Expand Down
52 changes: 50 additions & 2 deletions core/dbt/parser/schema_yaml_readers.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
from dbt.parser.schemas import YamlReader, SchemaParser
from dbt.parser.common import YamlBlock
from dbt.node_types import NodeType
from dbt.contracts.graph.unparsed import UnparsedExposure, UnparsedMetric, UnparsedGroup
from dbt.contracts.graph.nodes import Exposure, Metric, Group
from dbt.contracts.graph.unparsed import (
UnparsedExposure,
UnparsedGroup,
UnparsedMetric,
UnparsedSemanticModel,
)
from dbt.contracts.graph.nodes import Exposure, Group, Metric, SemanticModel
from dbt.exceptions import DbtInternalError, YamlParseDictError, JSONValidationError
from dbt.context.providers import generate_parse_exposure, generate_parse_metrics
from dbt.contracts.graph.model_config import MetricConfig, ExposureConfig
Expand Down Expand Up @@ -269,3 +274,46 @@ def parse(self):
raise YamlParseDictError(self.yaml.path, self.key, data, exc)

self.parse_group(unparsed)


class SemanticModelParser(YamlReader):
def __init__(self, schema_parser: SchemaParser, yaml: YamlBlock):
super().__init__(schema_parser, yaml, "semantic_models")
self.schema_parser = schema_parser
self.yaml = yaml

def parse_semantic_model(self, unparsed: UnparsedSemanticModel):
package_name = self.project.project_name
unique_id = f"{NodeType.SemanticModel}.{package_name}.{unparsed.name}"
path = self.yaml.path.relative_path

fqn = self.schema_parser.get_fqn_prefix(path)
fqn.append(unparsed.name)

parsed = SemanticModel(
description=unparsed.description,
fqn=fqn,
model=unparsed.model,
name=unparsed.name,
node_relation=None, # Resolved from the value of "model" after parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my own understanding -- we need to resolve the relation later on because not all model nodes are available in the manifest for lookup at this point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the node relation evaluate lazily, so we can put its value in here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SemanticModels come from yaml only, and the referenced models should always be sql based and so already resolvable when this is parsed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichelleArk Yes, I discussed this with Gerda and she thought we could possibly handle it immediately, since models would likely have been created, but doing it at the end would be most likely to succeed without complications.

@aranke I'm can follow up the idea of lazy evaluation in the current sprint. If you have a specific method of doing it in mind can you add it as a comment to #7822?

original_file_path=self.yaml.path.original_file_path,
package_name=package_name,
path=path,
resource_type=NodeType.SemanticModel,
unique_id=unique_id,
entities=unparsed.entities,
measures=unparsed.measures,
dimensions=unparsed.dimensions,
)

self.manifest.add_semantic_model(self.yaml.file, parsed)

def parse(self):
for data in self.get_key_dicts():
try:
UnparsedSemanticModel.validate(data)
unparsed = UnparsedSemanticModel.from_dict(data)
except (ValidationError, JSONValidationError) as exc:
raise YamlParseDictError(self.yaml.path, self.key, data, exc)

self.parse_semantic_model(unparsed)
Loading