Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015
Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015
Changes from 4 commits
f72eda0
91d4797
2166d6c
c3edb12
eed59b4
20133fc
f8975e2
320cd90
d729fdf
966c902
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behaviour in the following situation:
Rule#1 - Covers the interval 2020-2021
Rule#2 - Covers the interval 2021-2022
Rule#3 - Covers the interval 2020-2022
In this case
validateRules
would not throw an exception, however, from my understanding, Rule#3 would never trigger.If that's the case, we should handle these cases as well (probably by maintaining a set of intervals that the rules seen so far cover and checking if the next rule isn't completely overshadowed by the interval set seen so far)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a nice check to have as well. Let me think about it more and adjust as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LakshSingla, actually, Druid currently does a fully contains interval check for drop rules, so rule #3 is valid with
dropByInterval
. So we can't really guarantee the overlapping condition in a type agnostic way (which apply to load rules, so we would probably need a separate contract)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I didn't get why that is the case. Rule#1 spans from [2020-2021). rule #2 spans from [2021-2022). Therefore, collectively, rule#1 and rule#2 span from [2020-2022). Is there any instant in the timeline that would cause rule#3 to fire without triggering either rule#1 or rule#2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, rule #3 is valid for the time interval if it existed in isolation. However, when present along with other rules, the precedence of those takes place, causing rule#3 to be effectively useless, which is what this check exists to detect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. Rule #3 would never be applied on any segment irrespective of the rule types.
While evaluating retention rules, as soon as a rule applies to a segment, we break and don't look at the other rules.
Also, in general, I am not entirely sure of the validation logic here.
fully covered
check is done correctly.A [4, 10)
andB [15, 20)
and a new intervalC [8, 17)
comes in. Adding this new interval C should merge all of A, B and C into one interval[4, 17)
.ForeverXyzRule
orIntervalXyzRule
(rule with fixed interval). ForPeriodXyzRule
, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute theeligibleInterval
at both start and end of thecurrInterval
but I need to give it some more thought.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's consider a segment with an interval
2020/2022
and the following drop rules in order:dropByInterval('2020/2021')
dropByInterval('2021/2022')
dropByInterval('2020/2022')
For the segment in question, only rule 3 will match as the complete interval is contained in the drop interval.
On the other hand, if we consider
loadByInterval
rules for the same intervals, you're right that all the rules will apply for the segment as load rules will fire as long as there is at least an overlap.Hope that clarifies why we cannot detect the overlapping case without baking in some type awareness, but let me know if I am missing something!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the incorrect assumption I made in my original example. This digression occurs from the semantics of how intervals are applied in load rules v/s in drop rules. In load rules, we only need for partial overlap for the rule to fire, while in the drop rule, we need a complete overlap for the rule to fire
However when we are doing a
contains
check in this logic, we are inherently assuming something about the structure of the rules.Consider the following case:
The system has 2 types of segments, one ranging from 2020-2021 and the other from 2020-2022. If the user wants to drop the segments ranging from 2020-2021 while preserving the other segments, how does the user craft intuitive rules for the same without causing the newly added check to fail? I only came up with the following, however, that fails the check.
I think this ties back to the common use case that we are trying to prevent the user from doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the use case above - I'd write the same rules as you've listed in the example. The first rule's interval 2020/2021 does not contain the second rule's interval 2020/2022. So, I'm not sure why the validation check would flag this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfaraz - please see my responses to your comments:
To check if an interval fully contains another, the proposed algorithm has a nested loop that goes through the sequence of rules to find if a larger interval contains another subsequent interval down in the list; O(n^2) algorithmic runtime, but practical runtime should be trivial though.
I suppose there are alternate approaches like condensing intervals - the additive and subtractive logic you're describing. My first thought is that this could work well for the overlapping case (which isn't the intent of this patch).
Re this:
The boundary check primarily exists to allow the case where a rule will fire at some point. Unlike the forever and interval rules, period rules factor in the reference timestamp thereby yielding distinct intervals at the boundaries. Something like:
PeriodRuleR3.eligibleInterval(now) = IntervalR3
PeriodRuleR3.eligibleInterval(1yAgo) = IntervalR4
The boundaries check is essentially to verify that a legit rule will fire at some point and allow such a rule. For example, consider the rules [
loadByPeriod(P10Y)
,loadByInterval('2020/2025')
]:I realize I need to update the javadocs. :)