Integration tests for concurrent append and replace #16755

AmatyaAvadhanula · 2024-07-18T12:23:49Z

Adds Integration tests for concurrent append and replace

Streaming ingestion concurrent with auto compaction with various granularities
concurrent batch and MSQ tasks

This PR has:

kfaraz · 2024-07-19T04:08:31Z

Thanks for the changes, @AmatyaAvadhanula !

At first glance, I have a couple of questions:

Why do we need the new locks API? Adding an API just for testing seems a little unnecessary. Is there a way to use any of the existing APIs for this purpose instead?
Please rename the test to ITConcurrentAppendReplaceTest or ITConcurrentStreamAppendReplaceTest (ConcurrentAutoCompactionTest sounds a little confusing)

asdf2014 · 2024-07-19T06:48:27Z

...ion-tests/src/test/resources/stream/data/supervisor_with_concurrent_locks_spec_template.json

+  },
+  "ioConfig": {
+    "%%TOPIC_KEY%%": "%%TOPIC_VALUE%%",
+    "%%STREAM_PROPERTIES_KEY%%": %%STREAM_PROPERTIES_VALUE%%,


Suggested change

"%%STREAM_PROPERTIES_KEY%%": %%STREAM_PROPERTIES_VALUE%%,

"%%STREAM_PROPERTIES_KEY%%": "%%STREAM_PROPERTIES_VALUE%%",

integration-tests/src/main/java/org/apache/druid/testing/utils/CompactionUtil.java

kfaraz · 2024-07-31T12:30:49Z

...ion-tests/src/test/resources/stream/data/supervisor_with_concurrent_locks_spec_template.json

@@ -0,0 +1,60 @@
+{


Is it possible to use an existing spec and just update the task context before submitting to the Overlord?

kfaraz · 2024-07-31T12:31:24Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java

@@ -1383,6 +1383,20 @@ Map<String, NavigableMap<DateTime, SortedMap<Interval, List<TaskLockPosse>>>> ge
    return running;
  }

+  Set<TaskLock> getLocksForDatasource(final String datasource)


I guess this is not used anymore since we have already added the /activeLocks API.

kfaraz · 2024-07-31T12:38:54Z

...c/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentStreamAppendReplaceTest.java

+  /**
+   * Retries until the segment count is as expected.
+   */
+  private void ensureSegmentsCount(int numExpectedSegments)


It seems a lot of the private methods were copied over from other ITs.
Is it possible reuse this code instead of copying it over?

kfaraz · 2024-07-31T12:40:12Z

...c/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentStreamAppendReplaceTest.java

+        checkAndSetConcurrentLocks();
+      }
+
+      // Verify the state with minute granularity


Should the MINUTE and ALL granularity verifications be separate tests of their own?

kfaraz

Both the tests added here (MSQ INSERT + concurrent compaction and streaming ingest + concurrent compaction) seem to be inherently flaky. We are triggering the INSERT/streaming ingest 5 times in the hope that we get at least one case where there was an active REPLACE lock that overlapped with an active APPEND lock.

It would be better if we could force this condition somehow, possibly by manipulating the readiness condition of the underlying tasks.

See if the following approach could be viable:

Write a small extension (Maybe there is already an extension to be used by ITs only. If there is, you could just use that.)
This extension should register a new task ITCompactionTask for type compact.
The task can extend CompactionTask and override the isReady method.
In the isReady() method, wait on some condition to happen and only then move on to the super.isReady(). The condition can be based on some task context parameter.
While posting the compaction config in the new IT, make sure to set the above mention context parameter appropriately.
The setup for the new IT will need to have this extension in the load list. (I think this can be done more easily in the new revised IT framework).

Edit: A shorter alternative to writing an extension could be as follows:

Modify the isReady() method of AbstractBatchTask or CompactionTask to look for a context parameter readyCondition.
The parameter readyCondition can take a value of some Condition class which can be based on time delay for now.
The code should look for this parameter only if a certain environment variable is set (say DRUID_IT_IMAGE_NAME).
The variable being set indicates that the environment is an IT environment and we should honor the IT state.
If the environment variable, the isReady() check behaves as usual.
While submitting compaction config, make sure to set the context parameter correctly.

Cons: The only con of this approach is that it pollutes production code with test related logic.

But maybe that's not a bad thing. Maybe in the future we could have a variety of DruidEnvironments and code could behave differently based on the current environment: PROD, TEST, INTEGRATION_TEST, SIMULATION 🤔 .

...sts/src/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentAppendReplaceTest.java

kfaraz · 2024-08-03T05:30:10Z

...sts/src/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentAppendReplaceTest.java

+  private String fullDatasourceName;
+
+  private int currentRowCount = 0;
+  private boolean concurrentAppendAndReplaceLocksExisted;


Why is this a field of the class? The checkAndSetConcurrentLocks method should return a boolean instead and the value should be verified in the relevant test.

checkAndSetConcurrentLocks is called from several methods that retry to check a certain state. This is done to ensure that we check for these locks frequently.
The methods have return values of their own, which is why I tried using a class-level variable for this check.

...sts/src/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentAppendReplaceTest.java

kfaraz · 2024-08-03T05:39:38Z

...sts/src/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentAppendReplaceTest.java

+    submitAndVerifyCompactionConfig(datasource, null);
+
+    for (int i = 0; i < 5; i++) {
+      // Submit the task and wait for the datasource to get loaded
+      msqHelper.submitMsqTaskSuccesfully(queryLocal, ImmutableMap.of(Tasks.USE_CONCURRENT_LOCKS, true));
+
+      ensureRowCount(datasource, (i + 2) * 3, function);
+    }


Don't we need to forceTriggerCompaction for this test?

ensureRowCount is one of the methods that checks for concurrent locks and forceTriggers compaction for every retry. ensureSegmentsCount is another.
Please let me know if this should be done differently.

Yeah, ensureRowCount sounds like a verification method. It shouldn't be responsible for triggering compaction too.

kfaraz · 2024-08-03T05:42:42Z

...sts/src/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentAppendReplaceTest.java

+    if (CollectionUtils.isNullOrEmpty(locks)) {
+      return;
+    }
+    LOG.info(locks.toString());


Please add more info in this log line.

This was added for debugging and has been removed. Thanks

AmatyaAvadhanula · 2024-08-06T04:15:29Z

@kfaraz, compaction is being forcefully triggered immediately after the submission of the MSQ and kafka tasks and I believe this should guarantee the existence of concurrent locks.
The reason we repeat it several times is to try to have multiple permutations of (replace lock, append lock, replace commit, append commit) within the same test.

kfaraz · 2024-08-06T04:19:52Z

@kfaraz, compaction is being forcefully triggered immediately after the submission of the MSQ and kafka tasks and I believe this should guarantee the existence of concurrent locks.

Okay, that works for me. If the test is not flaky, then we can proceed.
Are we asserting the presence of concurrent locks after every iteration or only at the end? Let's do it after every iteration to be sure.

The reason we repeat it several times is to try to have multiple permutations of (replace lock, append lock, replace commit, append commit) within the same test.

But these permutations are still not guaranteed, right? We might end up with different orders in each run of the IT.

AmatyaAvadhanula · 2024-08-06T04:22:58Z

We are asserting the concurrent locks quite frequently when we verify the segment count or row count after each iteration.

Yes, the idea isn't to ensure that all 6 valid permutations happen in a single IT. It's just to ensure that the IT is more robust. The permutations themselves are exhaustively tested as Unit tests.

Thanks for your feedback. I'll address it ASAP

kfaraz · 2024-08-06T06:27:07Z

Yes, the idea isn't to ensure that all 6 valid permutations happen in a single IT. It's just to ensure that the IT is more robust. The permutations themselves are exhaustively tested as Unit tests.

Okay, we can proceed for now.

For completeness though, we need clarity on some points and would like to make these future improvements in this test:
A test is robust if is highly deterministic. A test that may end up trying out different things in each run is not.
With the multiple runs, what you are ensuring is that the code being tested is more robust, but not in a very
deterministic manner.

If unit tests have already covered most of the cases, then it is okay if we test only a few scenarios in the IT.
But whatever we decide to test, the test state must be exactly reproducible.

…pend_replace

...sts/src/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentAppendReplaceTest.java

…pend_replace

kfaraz

+1 after CI passes

…pend_replace

kfaraz · 2024-08-22T06:20:24Z

integration-tests/pom.xml

@@ -444,6 +444,12 @@
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
        </dependency>
+        <dependency>
+            <groupId>junit</groupId>


Sorry for the confusion, @AmatyaAvadhanula ! I didn't realize that these tests were using TestNG.

You can leave out the Assume condition that I had suggested as that is the only thing that requires junit. No point having dependencies on both junit and TestNG.

…pend_replace

kfaraz · 2024-09-03T09:27:49Z

Merging this PR.

…he#16755)" This reverts commit 70bad94.

…)" (#17000) This reverts commit 70bad94.

IT for streaming tasks with concurrent compaction

…he#16755)" (apache#17000) This reverts commit 70bad94.

IT for streaming tasks with concurrent compaction

7bf34fb

github-actions bot added the Area - Ingestion label Jul 18, 2024

Test for coverage

0a6ee67

AmatyaAvadhanula marked this pull request as ready for review July 19, 2024 03:53

AmatyaAvadhanula requested a review from kfaraz July 19, 2024 03:55

asdf2014 reviewed Jul 19, 2024

View reviewed changes

AmatyaAvadhanula added 4 commits July 30, 2024 11:49

Resolve merge conflicts

d2d1b60

Use new API

95691f8

Resolve merge conflicts

89a8016

Fix test

cd5018b

kfaraz reviewed Jul 31, 2024

View reviewed changes

Add concurrent msq test

b701fd0

kfaraz reviewed Aug 3, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/master' into its_concurrent_ap…

578dc73

…pend_replace

gargvishesh reviewed Aug 7, 2024

View reviewed changes

...sts/src/test/java/org/apache/druid/tests/coordinator/duty/ITConcurrentAppendReplaceTest.java Outdated Show resolved Hide resolved

AmatyaAvadhanula added 2 commits August 9, 2024 10:03

Address review comments

50f72e5

Merge remote-tracking branch 'upstream/master' into its_concurrent_ap…

76d5b2e

…pend_replace

kfaraz approved these changes Aug 14, 2024

View reviewed changes

AmatyaAvadhanula added 2 commits August 22, 2024 09:25

Merge remote-tracking branch 'upstream/master' into its_concurrent_ap…

9a22545

…pend_replace

Declare junit dependency

f318ff7

github-actions bot added the Area - Dependencies label Aug 22, 2024

kfaraz reviewed Aug 22, 2024

View reviewed changes

AmatyaAvadhanula added 3 commits August 22, 2024 14:54

Fix ITs

378f541

Try to fix MSQ IT

0bdda74

Try to fix MSQ IT

1df57c0

AmatyaAvadhanula added 2 commits September 3, 2024 10:04

Merge remote-tracking branch 'upstream/master' into its_concurrent_ap…

f8d123f

…pend_replace

Remove concurrent MSQ IT

7e02940

kfaraz approved these changes Sep 3, 2024

View reviewed changes

kfaraz merged commit 70bad94 into apache:master Sep 3, 2024
90 checks passed

kfaraz deleted the its_concurrent_append_replace branch September 3, 2024 09:28

AmatyaAvadhanula added a commit to AmatyaAvadhanula/druid that referenced this pull request Sep 4, 2024

Revert "Add integration tests for concurrent append and replace (apac…

5628898

…he#16755)" This reverts commit 70bad94.

AmatyaAvadhanula mentioned this pull request Sep 4, 2024

Revert flaky concurrent append and streaming replace ITs #17000

Merged

10 tasks

kfaraz pushed a commit that referenced this pull request Sep 4, 2024

Revert "Add integration tests for concurrent append and replace (#16755…

bfbd21b

…)" (#17000) This reverts commit 70bad94.

edgar2020 pushed a commit to edgar2020/druid that referenced this pull request Sep 5, 2024

Add integration tests for concurrent append and replace (apache#16755)

4c594fd

IT for streaming tasks with concurrent compaction

edgar2020 pushed a commit to edgar2020/druid that referenced this pull request Sep 5, 2024

Revert "Add integration tests for concurrent append and replace (apac…

ce3184e

…he#16755)" (apache#17000) This reverts commit 70bad94.

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration tests for concurrent append and replace #16755

Integration tests for concurrent append and replace #16755

AmatyaAvadhanula commented Jul 18, 2024 •

edited

Loading

kfaraz commented Jul 19, 2024

asdf2014 Jul 19, 2024

kfaraz Jul 31, 2024

kfaraz Jul 31, 2024

AmatyaAvadhanula Aug 9, 2024

kfaraz Jul 31, 2024

kfaraz Jul 31, 2024

kfaraz left a comment •

edited

Loading

kfaraz Aug 3, 2024

AmatyaAvadhanula Aug 7, 2024

kfaraz Aug 3, 2024

AmatyaAvadhanula Aug 7, 2024

kfaraz Aug 7, 2024

kfaraz Aug 3, 2024

AmatyaAvadhanula Aug 9, 2024

AmatyaAvadhanula commented Aug 6, 2024

kfaraz commented Aug 6, 2024

AmatyaAvadhanula commented Aug 6, 2024

kfaraz commented Aug 6, 2024

kfaraz left a comment

kfaraz Aug 22, 2024

kfaraz commented Sep 3, 2024

	"%%STREAM_PROPERTIES_KEY%%": %%STREAM_PROPERTIES_VALUE%%,
	"%%STREAM_PROPERTIES_KEY%%": "%%STREAM_PROPERTIES_VALUE%%",

Integration tests for concurrent append and replace #16755

Integration tests for concurrent append and replace #16755

Conversation

AmatyaAvadhanula commented Jul 18, 2024 • edited Loading

kfaraz commented Jul 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmatyaAvadhanula commented Aug 6, 2024

kfaraz commented Aug 6, 2024

AmatyaAvadhanula commented Aug 6, 2024

kfaraz commented Aug 6, 2024

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz commented Sep 3, 2024

AmatyaAvadhanula commented Jul 18, 2024 •

edited

Loading

kfaraz left a comment •

edited

Loading