-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable custom granularity inputs for filters #346
Conversation
We are enabling support for custom granularity, which means users will need to be able to reference them in where filters and other contexts where their input representations of their manifest elements will need to be parsed. Historically, we have restricted time granularities to an enumerated set of standard supported grains. With the addition of custom granularities we must now admit the possibility of arbitrary string names representing valid time granularities. This change updates the interfaces for parsing time dimensions out of filter expressions to allow for arbitrary string inputs representing custom granularity names. For various reasons - mainly having to do with the filter parser not having access to the semantic manifest - we will NOT be able to support custom granularity references via dundered string names. So something like `metric_time__martian_year` will not work, but the explicit syntax of `Dimension('metric_time').grain('martian_year')` will be accepted and processed.
raise InvalidQuerySyntax( | ||
f"{self.time_granularity_name!r} is not a valid time granularity. Valid values are" | ||
f" {valid_time_granularity_names}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with custom time granularities. However, it seems like we could still validate that the time granularity used at query time matches a custom one defined by the analytics engineer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to do this too, is it too complex based on the need to pass around the manifest lookup @tlento?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like validation here. I dug into this further and the tl,dr; is:
- This is sufficiently difficult to do from within DSI that I don't think it's worth it.
- I might be wrong, so if either of you have suggestions for an approach that I'm missing please let me know!
Details below.....
This class stack is complicated in a way that makes this look very self-contained when it's actually exposed directly underneath the external protocol interface for the WhereFilter
objects themselves. Specifically, this class is initialized in the ObjectBuildingRenderingHelper
, which is initialized in the ObjectBuilderTextProcessor
, which is initialized in the WhereFilterParser.parse_item_descriptions
method. That method is used in the call_parameter_sets property inside of the PydanticWhereFilter
, which is the implementation of the WhereFilter
protocol for the semantic layer spec. The WhereFilter
protocol requires the call_parameter_sets
property, and marks it as a read-only property, meaning the signature does not allow for inputs.
So the only way to add a mechanism for doing this check from within DSI is to either:
- Alter the WhereFilter protocol to incorporate a custom granularities attribute
- Update the interface for extracting call parameter sets to allow for passing a list of granularities into the method so we can thread it through to this point in the call stack
Option 1 won't allow us to do an effective validation check at this level without restructuring our entire parsing process - we'd basically need to do a parsing pass to extract the custom granularity set and then do a second pass where we make use of that list. This is likely incompatible with the parsing implementation in dbt-core, although I admit I have not looked.
Option 2 is a breaking change because we'd need to change the WhereFilter
and WhereFilterIntersection
protocols. We could remove the call_parameter_sets property and have some external component derive that set of objects. This is awkward since the external component would need to implicitly depend on every WhereFilter's string where_sql_template
property having the same formatting. We could also change the call_parameter_sets property to a method that accepts an optional list of custom time granularities. That affects every call site against call_parameter_sets and effectively requires that we thread a list of granularity names through each time we access it. This is theoretically possible but it might become quite ugly.
Is it worth all of the effort to alter the call_parameter_sets interface? Today, we basically parse where filters in two circumstances:
- During yaml parse time, where we only really need to use them for direct extraction of element references and validation.
- At query time, when we generally have access to MetricFlow (or, at least, to a SemanticManifestLookup) and can do more robust processing around the call_parameter_sets we extract to ensure custom granularities are properly specified. We can do this in terms of both the name reference and compatibility with requested metric_time grains and whatever else, which is a validation we have to do at query time anyway.
This means, effectively, that we can do nearly as good a job of validation in post-processing without changing the interface. The drawback is we cannot accept a where filter reference of the form Dimension('shuttle__launch_time__martian_day')
but I don't know if we really allow where filters like Dimension('shuttle_launch_time__day')
today anyway. I always thought we'd throw an error somewhere or other in that case, but it may be that it works. 🤷
Assuming I'm not missing anything here I'm inclined to keep this PR as-is - it's just not clear to me that the added validation on what is ultimately an arbitrary string name is worth the weight it'd add to the interface. But then I might be missing something, so please let me know if either of you have got any ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I guess maybe we do allow Dimension('entity__dimension__granularity')
since we rely on the call parameter sets in most places. This is very confusing, but I'm comfortable saying that only works with standard granularities for now, and we should really get rid of all this jinja stuff in the medium to long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'm fine with not validating for now and merging as-is, but will put up a task for me to investigate this later and see if I can come up with a non-breaking solution.
grain_parsed_from_name = ( | ||
group_by_item_name.time_granularity.value if group_by_item_name.time_granularity else None | ||
) | ||
grain_from_param = time_granularity_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line necessary? Why not just use time_granularity_name
instead of using a different variable with the same value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not necessary but I found this more readable due to the assertion comparison -> coalesce -> type refinement flow we have to go through here.
I have an idea for us to get the best of both worlds, I'll update separately.
grain_parsed_from_name = ( | ||
group_by_item_name.time_granularity.value if group_by_item_name.time_granularity else None | ||
) | ||
grain_from_param = time_granularity_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line necessary? Why not just use time_granularity_name
instead of using a different variable with the same value?
grain_parsed_from_name = ( | ||
group_by_item_name.time_granularity.value if group_by_item_name.time_granularity else None | ||
) | ||
grain_from_param = time_granularity_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line necessary? Why not just use time_granularity_name
instead of using a different variable with the same value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not have been clear about this in discussion, but I think it would be a much better user experience if we actually could pass the granularity name via dundered syntax. We would need to pass the manifest lookup around, which is annoying. But it should be possible since we're enforcing name uniqueness, right?
raise InvalidQuerySyntax( | ||
f"{self.time_granularity_name!r} is not a valid time granularity. Valid values are" | ||
f" {valid_time_granularity_names}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to do this too, is it too complex based on the need to pass around the manifest lookup @tlento?
"""Creates parameter sets for use in the Jinja sandbox.""" | ||
"""Creates parameter sets for use in the Jinja sandbox. | ||
|
||
This is ONLY to be used for parsing where filters, and then only for the purposes of extracting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? We also parse the jinja in 1) the JDBC and 2) saved query group by inputs. I believe this is used for at least the saved query group by inputs for some parsing in core, maybe related to node dependencies. The JDBC might use the MF logic, but I'm not positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those callsites have to be updated in a custom granularity world because this component asserted that we could always reliably determine whether or not <arbitrary_string> refers to a granularity, and that isn't the case anymore. I'll update the docstring to be less prescriptive and more detailed in terms of where this might go wrong, as I think that's more useful in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@tlento - reraising my comment here. I would like for us to be able to pass custom grains via dundered syntax, but is that limited by the same issues raised in the other thread on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned some follow ups I want to do but not blocking!
It's the same set of issues, and I think "limited" is the correct term - it's not impossible in all cases, but it's very difficult to do in a reasonable way from within this component inside DSI. We can support the dundered syntax for query parameters today - even where filters - we'd just need to make sure whatever parses them does a conversion from dundered input to what the DSI parser supports. The hard part is the manifest - filter parameters stored in the YAML can show up in any metric, and that means we'd need to have the list of custom grains before we try to do any parsing unless we change the interface requirements and remove all invocations of
I'd avoid the manifest lookup. All we need is the set of custom granularity names. I suggest pushing a breaking change to this API and changing call_parameter_sets from a property to a function and passing the set of custom granularity constructs in at that point. This only works if none of our YAML parsers use call_parameter_sets in its property format, which I suspect is the case but I'm not 100% sure. I'd use the equivalent of the "ExpandedTimeGranularity" struct instead of string names, because adding that is potentially useful to us in other ways. One issue with the breaking change - dbt-core 1.9.x is going to roll to beta soon. I'm not sure if it's worth the risk to try to jam this in right before those cutoffs. Worst case we can do a 2-stage shift, where we keep the property but have it only work for standard grains, add a new function to the protocol, cut everything calling it over to the function, and then remove the property. That defers the "breaking change" and allows us to add the new functionality in cloud even if we miss the core 1.9 cut. Source available users of MetricFlow won't be able to use custom calendar with the dunder syntax until 1.10, but they'll still be able to use the feature as long as they always specify the grain explicitly instead of relying on dunder magic. |
We are enabling support for custom granularity, which means
users will need to be able to reference them in where filters and
other contexts where their input representations of their manifest
elements will need to be parsed.
Historically, we have restricted time granularities to an enumerated
set of standard supported grains. With the addition of custom
granularities we must now admit the possibility of arbitrary string
names representing valid time granularities.
This change updates the interfaces for parsing time dimensions out
of filter expressions to allow for arbitrary string inputs representing
custom granularity names.
For various reasons - mainly having to do with the filter parser
not having access to the semantic manifest - we will NOT be able
to support custom granularity references via dundered string names. So
something like
metric_time__martian_year
will not work, butthe explicit syntax of
Dimension('metric_time').grain('martian_year')
will be accepted and processed.
Resolves #345