-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preempt ordering issue #3962
Comments
I will check this later |
Thank you for the scene you mentioned, which is very detailed. |
@lowang-bh @Monokaix @JesseStutler |
I didn't get why your PR fixes the problem you're talking about, in the problem you're describing it doesn't make a difference if the Job is sorted by UUID or by Name, if MinAvailable is 1 and the gang plugin is turned on, both job-1 and job-3 can't be preempted below 1 replica |
preempt-dev-job3-low-nginx-1 and preempt-dev-job1-low-priority-nginx-1 both have and need 4 GPUs broken down in 2 separate nodes. preempt-dev-job3-low-nginx-1 -> node 1 (2 GPUs), node 2 (2 GPUs)
I believe we have MinAvailable 0 (or default). What we found is that when the victim list on a node (tasks) was sorted by UUID rather than by name, the sorting order wasn't consistent across all nodes. Makes sense? |
First of all, I am a little confused about the description of your issue. You said that the 4 pods of job-2 occupy the 4 GPUs of Node-1, and then job2 and job3 each have two replicas, occupying node-2, so it should be preempt-dev-job1-low-priority-nginx-1(node 2, 1GPU), preempt-dev-job1-low-priority-nginx-0(node 2, 1GPU), preempt-dev-job3-low-nginx-1 (node2,1GPU), preempt-dev-job3-low-nginx-0(node2, 1GPU):
Besides, the logic of the code is:
This is not related to node order, and your minAvailable is 0, both job2 and job3's priorities are the same, so there is no difference between UUID and Name. job3's pods or job2's pods will be preempted first. volcano/pkg/scheduler/actions/preempt/preempt.go Lines 261 to 276 in 7584551
|
Thanks for the feedback. The last explanation was different from the original one in the issue (I'm actually relaying information from one of my teammates) We do have fork where we changed sorting from job UID to job Name and it's working for us. (we were seeing 2 different jobs with the same UID) Do you know where UID comes from? what guarantees that it's always unique? Also, happy to chat on the CNCF slack if you are there. Thanks! |
Description
We have 2 nodes with 4 GPU each and we have the following jobs deployed
preempt-dev-job3-low-nginx-1
which is part of a gang job.preempt-dev-job3-low-nginx-1
andpreempt-dev-job1-low-priority-nginx-1
always show as second on the victims list (Those are the ones picked as victims)Steps to reproduce the issue
Describe the results you received and expected
Expect to always to have consistent ordering for preempting victims.
What version of Volcano are you using?
1.10
Any other relevant information
No response
The text was updated successfully, but these errors were encountered: