Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015

Closed
wants to merge 10 commits into from
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import org.apache.druid.java.util.common.Intervals;
import org.apache.druid.timeline.DataSegment;
import org.joda.time.DateTime;
import org.joda.time.Interval;
Expand Down Expand Up @@ -56,6 +57,12 @@ public boolean appliesTo(Interval interval, DateTime referenceTimestamp)
return true;
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return Intervals.ETERNITY;
}

@Override
public boolean equals(Object o)
{
Expand All @@ -75,4 +82,10 @@ public int hashCode()
{
return Objects.hash(getType());
}

@Override
public String toString()
{
return "ForeverBroadcastDistributionRule{}";
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
package org.apache.druid.server.coordinator.rules;

import com.fasterxml.jackson.annotation.JsonProperty;
import org.apache.druid.java.util.common.Intervals;
import org.apache.druid.timeline.DataSegment;
import org.joda.time.DateTime;
import org.joda.time.Interval;
Expand All @@ -46,4 +47,16 @@ public boolean appliesTo(Interval interval, DateTime referenceTimestamp)
{
return true;
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return Intervals.ETERNITY;
}

@Override
public String toString()
{
return "ForeverDropRule{}";
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import org.apache.druid.java.util.common.Intervals;
import org.apache.druid.timeline.DataSegment;
import org.joda.time.DateTime;
import org.joda.time.Interval;
Expand Down Expand Up @@ -60,4 +61,18 @@ public boolean appliesTo(Interval interval, DateTime referenceTimestamp)
return true;
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return Intervals.ETERNITY;
}

@Override
public String toString()
{
return "ForeverLoadRule{" +
"tieredReplicants=" + getTieredReplicants() +
", useDefaultTierForNull=" + useDefaultTierForNull() +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,12 @@ public Interval getInterval()
return interval;
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return interval;
}

@Override
public boolean equals(Object o)
{
Expand All @@ -83,4 +89,12 @@ public int hashCode()
{
return Objects.hash(getInterval());
}

@Override
public String toString()
{
return "IntervalBroadcastDistributionRule{" +
"interval=" + interval +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,12 @@ public boolean appliesTo(Interval theInterval, DateTime referenceTimestamp)
return interval.contains(theInterval);
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return interval;
}

@Override
public boolean equals(Object o)
{
Expand All @@ -84,4 +90,12 @@ public int hashCode()
{
return Objects.hash(interval);
}

@Override
public String toString()
{
return "IntervalDropRule{" +
"interval=" + interval +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import org.apache.druid.java.util.common.logger.Logger;
import org.apache.druid.timeline.DataSegment;
import org.joda.time.DateTime;
import org.joda.time.Interval;
Expand All @@ -34,8 +33,6 @@
*/
public class IntervalLoadRule extends LoadRule
{
private static final Logger log = new Logger(IntervalLoadRule.class);

private final Interval interval;

@JsonCreator
Expand Down Expand Up @@ -74,6 +71,12 @@ public boolean appliesTo(Interval theInterval, DateTime referenceTimestamp)
return Rules.eligibleForLoad(interval, theInterval);
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return interval;
}

@Override
public boolean equals(Object o)
{
Expand All @@ -95,4 +98,14 @@ public int hashCode()
{
return Objects.hash(super.hashCode(), interval);
}

@Override
public String toString()
{
return "IntervalLoadRule{" +
"interval=" + interval +
", tieredReplicants=" + getTieredReplicants() +
", useDefaultTierForNull=" + useDefaultTierForNull() +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,13 @@ public boolean appliesTo(Interval interval, DateTime referenceTimestamp)
return Rules.eligibleForLoad(period, interval, referenceTimestamp, includeFuture);
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return includeFuture ? new Interval(referenceTimestamp.minus(period), referenceTimestamp.plus(period))
abhishekrb19 marked this conversation as resolved.
Show resolved Hide resolved
: new Interval(referenceTimestamp.minus(period), referenceTimestamp);
}

@JsonProperty
public Period getPeriod()
{
Expand Down Expand Up @@ -96,4 +103,13 @@ public int hashCode()
{
return Objects.hash(getPeriod(), isIncludeFuture());
}

@Override
public String toString()
{
return "PeriodBroadcastDistributionRule{" +
"period=" + period +
", includeFuture=" + includeFuture +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import org.apache.druid.java.util.common.DateTimes;
import org.apache.druid.timeline.DataSegment;
import org.joda.time.DateTime;
import org.joda.time.Interval;
Expand Down Expand Up @@ -63,4 +64,18 @@ public boolean appliesTo(Interval theInterval, DateTime referenceTimestamp)
final DateTime periodAgo = referenceTimestamp.minus(period);
return theInterval.getEndMillis() <= periodAgo.getMillis();
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return new Interval(DateTimes.utc(Long.MIN_VALUE), referenceTimestamp.minus(period));
abhishekrb19 marked this conversation as resolved.
Show resolved Hide resolved
}

@Override
public String toString()
{
return "PeriodDropBeforeRule{" +
"period=" + period +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import org.joda.time.Period;

/**
*
*/
public class PeriodDropRule extends DropRule
{
Expand Down Expand Up @@ -80,4 +81,20 @@ public boolean appliesTo(Interval theInterval, DateTime referenceTimestamp)
return currInterval.contains(theInterval);
}
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return includeFuture ? new Interval(referenceTimestamp.minus(period), referenceTimestamp.plus(period))
abhishekrb19 marked this conversation as resolved.
Show resolved Hide resolved
: new Interval(referenceTimestamp.minus(period), referenceTimestamp);
}

@Override
public String toString()
{
return "PeriodDropRule{" +
"period=" + period +
", includeFuture=" + includeFuture +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import org.apache.druid.java.util.common.logger.Logger;
import org.apache.druid.timeline.DataSegment;
import org.joda.time.DateTime;
import org.joda.time.Interval;
Expand All @@ -35,7 +34,6 @@
*/
public class PeriodLoadRule extends LoadRule
{
private static final Logger log = new Logger(PeriodLoadRule.class);
static final boolean DEFAULT_INCLUDE_FUTURE = true;

private final Period period;
Expand Down Expand Up @@ -85,6 +83,13 @@ public boolean appliesTo(Interval interval, DateTime referenceTimestamp)
return Rules.eligibleForLoad(period, interval, referenceTimestamp, includeFuture);
}

@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return includeFuture ? new Interval(referenceTimestamp.minus(period), referenceTimestamp.plus(period))
: new Interval(referenceTimestamp.minus(period), referenceTimestamp);
}

@Override
public boolean equals(Object o)
{
Expand All @@ -106,4 +111,15 @@ public int hashCode()
{
return Objects.hash(super.hashCode(), period, includeFuture);
}

@Override
public String toString()
{
return "PeriodLoadRule{" +
"period=" + period +
", includeFuture=" + includeFuture +
", tieredReplicants=" + getTieredReplicants() +
", useDefaultTierForNull=" + useDefaultTierForNull() +
'}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,11 @@ public interface Rule
boolean appliesTo(Interval interval, DateTime referenceTimestamp);

void run(DataSegment segment, SegmentActionHandler segmentHandler);

/**
* Return an eligible interval from the reference timestamp. Implementations
* must return a valid interval based on the rule type.
* @param referenceTimestamp base timestamp
*/
abhishekrb19 marked this conversation as resolved.
Show resolved Hide resolved
Interval getEligibleInterval(DateTime referenceTimestamp);
}
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,15 @@

package org.apache.druid.server.coordinator.rules;

import org.apache.druid.error.InvalidInput;
import org.apache.druid.java.util.common.DateTimes;
import org.apache.druid.java.util.common.Intervals;
import org.joda.time.DateTime;
import org.joda.time.Interval;
import org.joda.time.Period;

import java.util.List;

public class Rules
{
public static boolean eligibleForLoad(Interval src, Interval target)
abhishekrb19 marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -43,4 +48,40 @@ public static boolean eligibleForLoad(Period period, Interval interval, DateTime
private Rules()
{
}

/**
* Validate rules. This method throws an exception if a rule contain an interval that
* will fully cover the next rules' interval in the list. Rules that will be evaluated at some point
* are considered to be legitimate.
* @param rules Datasource rules.
*/
abhishekrb19 marked this conversation as resolved.
Show resolved Hide resolved
public static void validateRules(final List<Rule> rules)
{
if (rules == null) {
return;
}
final DateTime now = DateTimes.nowUtc();
for (int i = 0; i < rules.size() - 1; i++) {
final Rule currRule = rules.get(i);
final Rule nextRule = rules.get(i + 1);
final Interval currInterval = currRule.getEligibleInterval(now);
final Interval nextInterval = nextRule.getEligibleInterval(now);
if (currInterval.contains(nextInterval)) {
// If the current rule has an eternity interval, it covers everything following it.
// Or if the current rule still covers the next rule at the current interval boundaries, then the
// next rule will never fire at any time, so throw an exception.
if (Intervals.ETERNITY.equals(currInterval) ||
(currRule.getEligibleInterval(currInterval.getStart()).contains(nextRule.getEligibleInterval(currInterval.getStart()))
&& currRule.getEligibleInterval(currInterval.getEnd()).contains(nextRule.getEligibleInterval(currInterval.getEnd())))) {
throw InvalidInput.exception(
"Rule[%s] has an interval that contains interval for rule[%s]. The interval[%s] also covers interval[%s].",
currRule,
nextRule,
currInterval,
nextInterval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behaviour in the following situation:
Rule#1 - Covers the interval 2020-2021
Rule#2 - Covers the interval 2021-2022
Rule#3 - Covers the interval 2020-2022

In this case validateRules would not throw an exception, however, from my understanding, Rule#3 would never trigger.

If that's the case, we should handle these cases as well (probably by maintaining a set of intervals that the rules seen so far cover and checking if the next rule isn't completely overshadowed by the interval set seen so far)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a nice check to have as well. Let me think about it more and adjust as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LakshSingla, actually, Druid currently does a fully contains interval check for drop rules, so rule #3 is valid with dropByInterval. So we can't really guarantee the overlapping condition in a type agnostic way (which apply to load rules, so we would probably need a separate contract)

Copy link
Contributor

@LakshSingla LakshSingla Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I didn't get why that is the case. Rule#1 spans from [2020-2021). rule #2 spans from [2021-2022). Therefore, collectively, rule#1 and rule#2 span from [2020-2022). Is there any instant in the timeline that would cause rule#3 to fire without triggering either rule#1 or rule#2.

Copy link
Contributor

@LakshSingla LakshSingla Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, rule #3 is valid for the time interval if it existed in isolation. However, when present along with other rules, the precedence of those takes place, causing rule#3 to be effectively useless, which is what this check exists to detect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Rule #3 would never be applied on any segment irrespective of the rule types.
While evaluating retention rules, as soon as a rule applies to a segment, we break and don't look at the other rules.

Also, in general, I am not entirely sure of the validation logic here.

  • For one, the set thing suggested by @LakshSingla makes sense. But we would have to ensure that the fully covered check is done correctly.
    • Additive logic (cleaner to implement): Say your set contains intervals A [4, 10) and B [15, 20) and a new interval C [8, 17) comes in. Adding this new interval C should merge all of A, B and C into one interval [4, 17).
    • Subtractive logic: To check if an interval has any part which is not fully covered by the preceding intervals, remove all parts of it that overlap with intervals already encountered. If you are left with an empty interval, the rule is not valid.
  • Secondly, all of this is easy to follow when we are dealing with ForeverXyzRule or IntervalXyzRule (rule with fixed interval). For PeriodXyzRule, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute the eligibleInterval at both start and end of the currInterval but I need to give it some more thought.
  • There could even be other types of rules in the future, but I guess we need not worry about them now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's consider a segment with an interval 2020/2022 and the following drop rules in order:

  1. dropByInterval('2020/2021')
  2. dropByInterval('2021/2022')
  3. dropByInterval('2020/2022')

For the segment in question, only rule 3 will match as the complete interval is contained in the drop interval.

On the other hand, if we consider loadByInterval rules for the same intervals, you're right that all the rules will apply for the segment as load rules will fire as long as there is at least an overlap.

Hope that clarifies why we cannot detect the overlapping case without baking in some type awareness, but let me know if I am missing something!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the incorrect assumption I made in my original example. This digression occurs from the semantics of how intervals are applied in load rules v/s in drop rules. In load rules, we only need for partial overlap for the rule to fire, while in the drop rule, we need a complete overlap for the rule to fire

However when we are doing a contains check in this logic, we are inherently assuming something about the structure of the rules.

Consider the following case:
The system has 2 types of segments, one ranging from 2020-2021 and the other from 2020-2022. If the user wants to drop the segments ranging from 2020-2021 while preserving the other segments, how does the user craft intuitive rules for the same without causing the newly added check to fail? I only came up with the following, however, that fails the check.

  1. Drop 2020-2021
  2. Load 2020-2022

I think this ties back to the common use case that we are trying to prevent the user from doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the use case above - I'd write the same rules as you've listed in the example. The first rule's interval 2020/2021 does not contain the second rule's interval 2020/2022. So, I'm not sure why the validation check would flag this?

Copy link
Contributor Author

@abhishekrb19 abhishekrb19 Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfaraz - please see my responses to your comments:

we would have to ensure that the fully covered check is done correctly.

To check if an interval fully contains another, the proposed algorithm has a nested loop that goes through the sequence of rules to find if a larger interval contains another subsequent interval down in the list; O(n^2) algorithmic runtime, but practical runtime should be trivial though.

I suppose there are alternate approaches like condensing intervals - the additive and subtractive logic you're describing. My first thought is that this could work well for the overlapping case (which isn't the intent of this patch).

Re this:

Secondly, all of this is easy to follow when we are dealing with ForeverXyzRule or IntervalXyzRule (rule with fixed interval). For PeriodXyzRule, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute the eligibleInterval at both start and end of the currInterval but I need to give it some more thought.

The boundary check primarily exists to allow the case where a rule will fire at some point. Unlike the forever and interval rules, period rules factor in the reference timestamp thereby yielding distinct intervals at the boundaries. Something like:

PeriodRuleR3.eligibleInterval(now) = IntervalR3
PeriodRuleR3.eligibleInterval(1yAgo) = IntervalR4

The boundaries check is essentially to verify that a legit rule will fire at some point and allow such a rule. For example, consider the rules [loadByPeriod(P10Y), loadByInterval('2020/2025')]:

  • With the boundary check (before and after 10 years from now), the first rule will not contain the second interval. So the validation check will not flag the rule set. The second rule will indeed fire at some point in the future.
  • Without the boundary check, the first rule will fully cover the second rule's interval, incorrectly flagging the second rule.

I realize I need to update the javadocs. :)

);
}
}
}
}
}
Loading