Change priority for scheduling reroute during timeout #16445

imRishN · 2024-10-23T05:07:51Z

Description

This PR updates the priority of scheduling reroute when timed out from HIGH to NORMAL. This is because consistent HIGH reroutes might starve NORMAL priority tasks. And moreover, NORMAL is right for reasonable clusters. For clusters in messed up state which is causing NORMAL priority tasks to starve, we add a new dynamic cluster setting to raise the priority of reroute task to allocate shards in such scenarios.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

~~[ ] Functionality includes testing.~~
~~[ ] API changes companion pull request created, if applicable.~~
~~[ ] Public documentation issue/PR created, if applicable.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2024-10-23T05:22:27Z

❌ Gradle check result for 5e83a92: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Bukhtawar · 2024-10-23T06:03:30Z

server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java

-                            "reroute after existing shards allocator timed out",
-                            Priority.HIGH,
+                            "reroute after existing shards allocator [R] timed out",
+                            Priority.NORMAL,


Should we have a separate priority for primary vs replica?

NORMAL also seems right for PSA. But during genuine issues in the cluster which can be identified with appropriate monitoring, we might need to raise it to HIGH. I will update the PR with a similar setting for ESA similar to BSA to raise reroute priority. Wdyt?

Bukhtawar

Lets update the PR description

imRishN · 2024-10-23T06:49:50Z

Lets update the PR description

Updated

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2024-10-23T15:16:11Z

❌ Gradle check result for 6a448d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2024-10-23T16:50:21Z

❌ Gradle check result for 825a983: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2024-10-23T20:09:53Z

❌ Gradle check result for 5368e7f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2024-10-24T03:58:35Z

❌ Gradle check result for 2ba604d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-10-25T06:53:02Z

✅ Gradle check result for 2ba604d: SUCCESS

codecov · 2024-10-25T06:53:29Z

Codecov Report

Attention: Patch coverage is 88.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.09%. Comparing base (72559bf) to head (7329867).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
.../allocation/allocator/BalancedShardsAllocator.java	78.57%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #16445      +/-   ##
============================================
- Coverage     72.11%   72.09%   -0.03%     
- Complexity    65071    65091      +20     
============================================
  Files          5313     5313              
  Lines        303413   303437      +24     
  Branches      43906    43908       +2     
============================================
- Hits         218816   218769      -47     
- Misses        66639    66785     +146     
+ Partials      17958    17883      -75

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2024-10-28T16:39:43Z

✅ Gradle check result for 7329867: SUCCESS

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2024-10-28T19:54:40Z

✅ Gradle check result for 33ffefb: SUCCESS

Bukhtawar · 2024-10-28T23:08:58Z

server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java

+        Setting.Property.NodeScope,
+        Setting.Property.Dynamic
+    );
+


This logic seems redundant

Do you mean to parse reroute priority?

linuxpi · 2024-11-11T05:01:48Z

...c/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+     */
+    public static final Setting<Priority> FOLLOW_UP_REROUTE_PRIORITY_SETTING = new Setting<>(
+        "cluster.routing.allocation.balanced_shards_allocator.schedule_reroute.priority",
+        Priority.NORMAL.toString(),


we should add a changelog as we are changing the default priority from HIGH to NORMAL

opensearch-trigger-bot · 2024-12-11T15:22:42Z

This PR is stalled because it has been open for 30 days with no activity.

Change priority for scheduling reroute in timeout

5e83a92

Signed-off-by: Rishab Nahata <[email protected]>

Bukhtawar reviewed Oct 23, 2024

View reviewed changes

imRishN added the skip-changelog label Oct 23, 2024

Add setting for ESA

6a448d0

Signed-off-by: Rishab Nahata <[email protected]>

imRishN marked this pull request as ready for review October 23, 2024 14:39

imRishN requested review from anasalkouz, andrross, ashking94, CEHENKLE, dblock, dbwiddis, gbbafna, jainankitk, kotwanikunal, linuxpi, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami and VachaShah as code owners October 23, 2024 14:39

Fix tests

825a983

Signed-off-by: Rishab Nahata <[email protected]>

imRishN changed the title ~~Change priority for scheduling reroute in timeout~~ Change priority for scheduling reroute during timeout Oct 23, 2024

Bukhtawar approved these changes Oct 23, 2024

View reviewed changes

opensearch-ci-bot mentioned this pull request Oct 23, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RemotePrimaryLocalRecoveryIT #14314

Open

Trigger Build

5368e7f

Signed-off-by: Rishab Nahata <[email protected]>

Trigger Build

2ba604d

Signed-off-by: Rishab Nahata <[email protected]>

This was referenced Oct 24, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RemoteFsTimestampAwareTranslogTests #15818

Open

[AUTOCUT] Gradle Check Flaky Test Report for SearchRestCancellationIT #14311

Open

imRishN added 2 commits October 28, 2024 20:54

Add test

6e3b4d0

Signed-off-by: Rishab Nahata <[email protected]>

Merge branch 'main' into followup-rerroute-priority

7329867

Signed-off-by: Rishab Nahata <[email protected]>

Trigger Build

33ffefb

Signed-off-by: Rishab Nahata <[email protected]>

Bukhtawar reviewed Oct 28, 2024

View reviewed changes

linuxpi reviewed Nov 11, 2024

View reviewed changes

opensearch-trigger-bot bot added the stalled Issues that have stalled label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change priority for scheduling reroute during timeout #16445

Change priority for scheduling reroute during timeout #16445

imRishN commented Oct 23, 2024 •

edited

Loading

github-actions bot commented Oct 23, 2024

Bukhtawar Oct 23, 2024

imRishN Oct 23, 2024

Bukhtawar left a comment

imRishN commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 24, 2024

github-actions bot commented Oct 25, 2024

codecov bot commented Oct 25, 2024 •

edited

Loading

github-actions bot commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

Bukhtawar Oct 28, 2024

imRishN Oct 29, 2024

Bukhtawar Oct 29, 2024

linuxpi Nov 11, 2024

opensearch-trigger-bot bot commented Dec 11, 2024

Change priority for scheduling reroute during timeout #16445

Are you sure you want to change the base?

Change priority for scheduling reroute during timeout #16445

Conversation

imRishN commented Oct 23, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Oct 23, 2024

Bukhtawar Oct 23, 2024

Choose a reason for hiding this comment

imRishN Oct 23, 2024

Choose a reason for hiding this comment

Bukhtawar left a comment

Choose a reason for hiding this comment

imRishN commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 24, 2024

github-actions bot commented Oct 25, 2024

codecov bot commented Oct 25, 2024 • edited Loading

Codecov Report

github-actions bot commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

Bukhtawar Oct 28, 2024

Choose a reason for hiding this comment

imRishN Oct 29, 2024

Choose a reason for hiding this comment

Bukhtawar Oct 29, 2024

Choose a reason for hiding this comment

linuxpi Nov 11, 2024

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Dec 11, 2024

imRishN commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 25, 2024 •

edited

Loading