Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change priority for scheduling reroute during timeout #16445

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

imRishN
Copy link
Member

@imRishN imRishN commented Oct 23, 2024

Description

This PR updates the priority of scheduling reroute when timed out from HIGH to NORMAL. This is because consistent HIGH reroutes might starve NORMAL priority tasks. And moreover, NORMAL is right for reasonable clusters. For clusters in messed up state which is causing NORMAL priority tasks to starve, we add a new dynamic cluster setting to raise the priority of reroute task to allocate shards in such scenarios.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • [ ] Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 5e83a92: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines 346 to 347
"reroute after existing shards allocator timed out",
Priority.HIGH,
"reroute after existing shards allocator [R] timed out",
Priority.NORMAL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a separate priority for primary vs replica?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NORMAL also seems right for PSA. But during genuine issues in the cluster which can be identified with appropriate monitoring, we might need to raise it to HIGH. I will update the PR with a similar setting for ESA similar to BSA to raise reroute priority. Wdyt?

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets update the PR description

@imRishN
Copy link
Member Author

imRishN commented Oct 23, 2024

Lets update the PR description

Updated

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

❌ Gradle check result for 6a448d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>
@imRishN imRishN changed the title Change priority for scheduling reroute in timeout Change priority for scheduling reroute during timeout Oct 23, 2024
Copy link
Contributor

❌ Gradle check result for 825a983: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

❌ Gradle check result for 5368e7f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

❌ Gradle check result for 2ba604d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 2ba604d: SUCCESS

Copy link

codecov bot commented Oct 25, 2024

Codecov Report

Attention: Patch coverage is 88.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.09%. Comparing base (72559bf) to head (7329867).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
.../allocation/allocator/BalancedShardsAllocator.java 78.57% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16445      +/-   ##
============================================
- Coverage     72.11%   72.09%   -0.03%     
- Complexity    65071    65091      +20     
============================================
  Files          5313     5313              
  Lines        303413   303437      +24     
  Branches      43906    43908       +2     
============================================
- Hits         218816   218769      -47     
- Misses        66639    66785     +146     
+ Partials      17958    17883      -75     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

✅ Gradle check result for 7329867: SUCCESS

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

✅ Gradle check result for 33ffefb: SUCCESS

Setting.Property.NodeScope,
Setting.Property.Dynamic
);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems redundant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to parse reroute priority?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

*/
public static final Setting<Priority> FOLLOW_UP_REROUTE_PRIORITY_SETTING = new Setting<>(
"cluster.routing.allocation.balanced_shards_allocator.schedule_reroute.priority",
Priority.NORMAL.toString(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a changelog as we are changing the default priority from HIGH to NORMAL

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants