Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable custom granularity inputs for filters #346

Merged
merged 3 commits into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .changes/unreleased/Features-20240908-221757.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Features
body: Allow filter parameters to accept custom granularities
time: 2024-09-08T22:17:57.865557-07:00
custom:
Author: tlento
Issue: "345"
3 changes: 1 addition & 2 deletions dbt_semantic_interfaces/call_parameter_sets.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
MetricReference,
TimeDimensionReference,
)
from dbt_semantic_interfaces.type_enums import TimeGranularity
from dbt_semantic_interfaces.type_enums.date_part import DatePart


Expand All @@ -29,7 +28,7 @@ class TimeDimensionCallParameterSet:

entity_path: Tuple[EntityReference, ...]
time_dimension_reference: TimeDimensionReference
time_granularity: Optional[TimeGranularity] = None
time_granularity_name: Optional[str] = None
date_part: Optional[DatePart] = None


Expand Down
6 changes: 5 additions & 1 deletion dbt_semantic_interfaces/naming/dundered.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,11 @@ class StructuredDunderedName:
element_name: "ds"
granularity: TimeGranularity.WEEK

The time granularity is part of legacy query syntax and there are plans to migrate away from this format.
The time granularity is part of legacy query syntax and there are plans to migrate away from this format. As such,
this will not be updated to allow for custom granularity values. This implies that any query paths that push named
parameters through this class will not support a custom grain reference of the form `metric_time__martian_year`,
and users wishing to use their martian year grain will have to explicitly reference it via a separate parameter
instead of gluing it onto the end of the name.
"""

entity_links: Tuple[EntityReference, ...]
Expand Down
13 changes: 2 additions & 11 deletions dbt_semantic_interfaces/parsing/text_input/ti_description.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
from dbt_semantic_interfaces.enum_extension import assert_values_exhausted
from dbt_semantic_interfaces.errors import InvalidQuerySyntax
from dbt_semantic_interfaces.naming.dundered import StructuredDunderedName
from dbt_semantic_interfaces.type_enums import TimeGranularity
from dbt_semantic_interfaces.type_enums.date_part import DatePart


Expand Down Expand Up @@ -65,21 +64,13 @@ def __post_init__(self) -> None: # noqa: D105
raise InvalidQuerySyntax("The entity path should not be specified for a metric.")
if len(structured_item_name.entity_links) > 0:
raise InvalidQuerySyntax("The name of the metric should not have entity links.")
# Check that dimensions / time dimensions have a valid time granularity / date part.
# Check that dimensions / time dimensions have a valid date part.
elif item_type is QueryItemType.DIMENSION or item_type is QueryItemType.TIME_DIMENSION:
if self.time_granularity_name is not None:
valid_time_granularity_names = set(time_granularity.value for time_granularity in TimeGranularity)
if self.time_granularity_name.lower() not in valid_time_granularity_names:
raise InvalidQuerySyntax(
f"{self.time_granularity_name!r} is not a valid time granularity. Valid values are"
f" {valid_time_granularity_names}"
)
Comment on lines -73 to -76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with custom time granularities. However, it seems like we could still validate that the time granularity used at query time matches a custom one defined by the analytics engineer, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to do this too, is it too complex based on the need to pass around the manifest lookup @tlento?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like validation here. I dug into this further and the tl,dr; is:

  1. This is sufficiently difficult to do from within DSI that I don't think it's worth it.
  2. I might be wrong, so if either of you have suggestions for an approach that I'm missing please let me know!

Details below.....

This class stack is complicated in a way that makes this look very self-contained when it's actually exposed directly underneath the external protocol interface for the WhereFilter objects themselves. Specifically, this class is initialized in the ObjectBuildingRenderingHelper, which is initialized in the ObjectBuilderTextProcessor, which is initialized in the WhereFilterParser.parse_item_descriptions method. That method is used in the call_parameter_sets property inside of the PydanticWhereFilter, which is the implementation of the WhereFilter protocol for the semantic layer spec. The WhereFilter protocol requires the call_parameter_sets property, and marks it as a read-only property, meaning the signature does not allow for inputs.

So the only way to add a mechanism for doing this check from within DSI is to either:

  1. Alter the WhereFilter protocol to incorporate a custom granularities attribute
  2. Update the interface for extracting call parameter sets to allow for passing a list of granularities into the method so we can thread it through to this point in the call stack

Option 1 won't allow us to do an effective validation check at this level without restructuring our entire parsing process - we'd basically need to do a parsing pass to extract the custom granularity set and then do a second pass where we make use of that list. This is likely incompatible with the parsing implementation in dbt-core, although I admit I have not looked.

Option 2 is a breaking change because we'd need to change the WhereFilter and WhereFilterIntersection protocols. We could remove the call_parameter_sets property and have some external component derive that set of objects. This is awkward since the external component would need to implicitly depend on every WhereFilter's string where_sql_template property having the same formatting. We could also change the call_parameter_sets property to a method that accepts an optional list of custom time granularities. That affects every call site against call_parameter_sets and effectively requires that we thread a list of granularity names through each time we access it. This is theoretically possible but it might become quite ugly.

Is it worth all of the effort to alter the call_parameter_sets interface? Today, we basically parse where filters in two circumstances:

  1. During yaml parse time, where we only really need to use them for direct extraction of element references and validation.
  2. At query time, when we generally have access to MetricFlow (or, at least, to a SemanticManifestLookup) and can do more robust processing around the call_parameter_sets we extract to ensure custom granularities are properly specified. We can do this in terms of both the name reference and compatibility with requested metric_time grains and whatever else, which is a validation we have to do at query time anyway.

This means, effectively, that we can do nearly as good a job of validation in post-processing without changing the interface. The drawback is we cannot accept a where filter reference of the form Dimension('shuttle__launch_time__martian_day') but I don't know if we really allow where filters like Dimension('shuttle_launch_time__day') today anyway. I always thought we'd throw an error somewhere or other in that case, but it may be that it works. 🤷

Assuming I'm not missing anything here I'm inclined to keep this PR as-is - it's just not clear to me that the added validation on what is ultimately an arbitrary string name is worth the weight it'd add to the interface. But then I might be missing something, so please let me know if either of you have got any ideas!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess maybe we do allow Dimension('entity__dimension__granularity') since we rely on the call parameter sets in most places. This is very confusing, but I'm comfortable saying that only works with standard granularities for now, and we should really get rid of all this jinja stuff in the medium to long term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm fine with not validating for now and merging as-is, but will put up a task for me to investigate this later and see if I can come up with a non-breaking solution.


if self.date_part_name is not None:
valid_date_part_names = set(date_part.value for date_part in DatePart)
if self.date_part_name.lower() not in set(date_part.value for date_part in DatePart):
raise InvalidQuerySyntax(
f"{self.time_granularity_name!r} is not a valid time granularity. Valid values are"
f"{self.date_part_name!r} is not a valid date part. Valid values are"
f" {valid_date_part_names}"
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,19 @@
MetricReference,
TimeDimensionReference,
)
from dbt_semantic_interfaces.type_enums import TimeGranularity
from dbt_semantic_interfaces.type_enums.date_part import DatePart


class ParameterSetFactory:
"""Creates parameter sets for use in the Jinja sandbox."""
"""Creates parameter sets for use in the Jinja sandbox.

This is ONLY to be used for parsing where filters, and then only for the purposes of extracting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? We also parse the jinja in 1) the JDBC and 2) saved query group by inputs. I believe this is used for at least the saved query group by inputs for some parsing in core, maybe related to node dependencies. The JDBC might use the MF logic, but I'm not positive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those callsites have to be updated in a custom granularity world because this component asserted that we could always reliably determine whether or not <arbitrary_string> refers to a granularity, and that isn't the case anymore. I'll update the docstring to be less prescriptive and more detailed in terms of where this might go wrong, as I think that's more useful in general.

some basic information about which elements are being accessed in the filter expression in the
small number of contexts where a more complete semantic layer implementation is not available.

In practice, today, this is used by the dbt core parser, which cannot take on a MetricFlow
dependency, in order to provide some DAG annotations around elements referenced in where filters.
"""

@staticmethod
def _exception_message_for_incorrect_format(element_name: str) -> str:
Expand All @@ -37,26 +44,46 @@ def create_time_dimension(
entity_path: Sequence[str] = (),
date_part_name: Optional[str] = None,
) -> TimeDimensionCallParameterSet:
"""Gets called by Jinja when rendering {{ TimeDimension(...) }}."""
"""Gets called by Jinja when rendering {{ TimeDimension(...) }}.

There is a lot of strangeness around the time granularity specification here. Historically,
we accepted time dimension names of the form `metric_time__week` or `customer__registration_date__month`
in this interface. We have not yet fully deprecated this, and it's unclear if we ever will.

Key points to note:
1. The time dimension name parsing only accepts standard time granularities. This will not change.
2. The time granularity parameter is what we want everybody to use because it's more explicit.
3. The time granularity parameter will support custom granularities, so that's nice

While this all may seem pretty bad it's not as terrible as all that - this class is only used
for parsing where filters. When we solve the problems with our current where filter spec this will
persist as a backwards compatibility model, but nothing more.
"""
group_by_item_name = DunderedNameFormatter.parse_name(time_dimension_name)
if len(group_by_item_name.entity_links) != 1 and not is_metric_time_name(group_by_item_name.element_name):
raise ParseWhereFilterException(
ParameterSetFactory._exception_message_for_incorrect_format(time_dimension_name)
)
grain_parsed_from_name = group_by_item_name.time_granularity
grain_from_param = TimeGranularity(time_granularity_name) if time_granularity_name else None
grain_parsed_from_name = (
group_by_item_name.time_granularity.value if group_by_item_name.time_granularity else None
)
grain_from_param = time_granularity_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line necessary? Why not just use time_granularity_name instead of using a different variable with the same value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary but I found this more readable due to the assertion comparison -> coalesce -> type refinement flow we have to go through here.

I have an idea for us to get the best of both worlds, I'll update separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line necessary? Why not just use time_granularity_name instead of using a different variable with the same value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line necessary? Why not just use time_granularity_name instead of using a different variable with the same value?

if grain_parsed_from_name and grain_from_param and grain_from_param != grain_parsed_from_name:
raise ParseWhereFilterException(
f"Received different grains in `time_dimension_name` parameter ('{time_dimension_name}') "
f"and `time_granularity_name` parameter ('{time_granularity_name}')."
f"and `time_granularity_name` parameter ('{time_granularity_name}'). Remove the grain suffix "
f"(`{grain_parsed_from_name}`) from the time dimension name and use the `time_granularity_name` "
"parameter to specify the intendend grain."
)

time_granularity_name = grain_parsed_from_name or grain_from_param

return TimeDimensionCallParameterSet(
time_dimension_reference=TimeDimensionReference(element_name=group_by_item_name.element_name),
entity_path=(
tuple(EntityReference(element_name=arg) for arg in entity_path) + group_by_item_name.entity_links
),
time_granularity=grain_parsed_from_name or grain_from_param,
time_granularity_name=time_granularity_name.lower() if time_granularity_name else None,
date_part=DatePart(date_part_name.lower()) if date_part_name else None,
)

Expand Down
12 changes: 6 additions & 6 deletions tests/implementations/where_filter/test_parse_calls.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def test_extract_dimension_with_grain_call_parameter_sets() -> None: # noqa: D
TimeDimensionCallParameterSet(
entity_path=(),
time_dimension_reference=TimeDimensionReference(element_name="metric_time"),
time_granularity=TimeGranularity.WEEK,
time_granularity_name=TimeGranularity.WEEK.value,
),
),
entity_call_parameter_sets=(),
Expand All @@ -91,7 +91,7 @@ def test_extract_time_dimension_call_parameter_sets() -> None: # noqa: D
EntityReference("listing"),
EntityReference("user"),
),
time_granularity=TimeGranularity.MONTH,
time_granularity_name=TimeGranularity.MONTH.value,
),
)
)
Expand All @@ -110,7 +110,7 @@ def test_extract_time_dimension_call_parameter_sets() -> None: # noqa: D
EntityReference("listing"),
EntityReference("user"),
),
time_granularity=TimeGranularity.MONTH,
time_granularity_name=TimeGranularity.MONTH.value,
),
)
)
Expand All @@ -126,7 +126,7 @@ def test_extract_metric_time_dimension_call_parameter_sets() -> None: # noqa: D
TimeDimensionCallParameterSet(
time_dimension_reference=TimeDimensionReference(element_name="metric_time"),
entity_path=(),
time_granularity=TimeGranularity.MONTH,
time_granularity_name=TimeGranularity.MONTH.value,
),
)
)
Expand Down Expand Up @@ -216,7 +216,7 @@ def test_where_filter_interesection_extract_call_parameter_sets() -> None:
TimeDimensionCallParameterSet(
time_dimension_reference=TimeDimensionReference(element_name="metric_time"),
entity_path=(),
time_granularity=TimeGranularity.MONTH,
time_granularity_name=TimeGranularity.MONTH.value,
),
)
)
Expand Down Expand Up @@ -269,7 +269,7 @@ def test_time_dimension_without_granularity() -> None: # noqa: D
TimeDimensionCallParameterSet(
entity_path=(EntityReference("booking"),),
time_dimension_reference=TimeDimensionReference(element_name="created_at"),
time_granularity=None,
time_granularity_name=None,
),
),
entity_call_parameter_sets=(),
Expand Down
4 changes: 2 additions & 2 deletions tests/parsing/test_object_builder_item_description.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,11 @@ def test_valid_object_builder_items() -> None: # noqa: D
valid_items = (
"Dimension('listing__created_at', entity_path=['host'])",
"Dimension('listing__created_at', entity_path=['host']).grain('day').date_part('day')",
"Dimension('metric_time').grain('martian_year')",
"TimeDimension('listing__created_at', time_granularity_name='day', entity_path=['host'], date_part_name='day')",
"Entity('listing__created_at', entity_path=['host'])",
"Metric('bookings', group_by=['listing__created_at'])",
"TimeDimension('metric_time', time_granularity_name='martian_year')",
)
for valid_item in valid_items:
logger.info(f"Checking {valid_item=}")
Expand All @@ -32,9 +34,7 @@ def test_invalid_object_builder_items() -> None: # noqa: D
text_processor = ObjectBuilderTextProcessor()

invalid_items = (
"Dimension('listing__created_at').grain('invalid')",
"Dimension('listing__created_at').date_part('invalid')",
"TimeDimension('listing__created_at', 'invalid', 'day')",
"TimeDimension('listing__created_at', 'day', date_part_name='invalid')",
"TimeDimension('listing__created_at', 'day', date_part_name='month').grain('month')",
"TimeDimension('listing__created_at', 'day', date_part_name='month').date_part('month')",
Expand Down
4 changes: 2 additions & 2 deletions tests/parsing/test_where_filter_parsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,15 +185,15 @@ def test_dimension_date_part() -> None: # noqa
TimeDimensionCallParameterSet(
time_dimension_reference=TimeDimensionReference("metric_time"),
entity_path=(),
time_granularity=TimeGranularity.WEEK,
time_granularity_name=TimeGranularity.WEEK.value,
),
),
(
"{{ TimeDimension('metric_time', time_granularity_name='week') }} > '2023-01-01'",
TimeDimensionCallParameterSet(
time_dimension_reference=TimeDimensionReference("metric_time"),
entity_path=(),
time_granularity=TimeGranularity.WEEK,
time_granularity_name=TimeGranularity.WEEK.value,
),
),
],
Expand Down
Loading