Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate / review / setup node selection logic #1970

Open
Tracked by #1966
vexingly opened this issue Sep 18, 2024 · 9 comments
Open
Tracked by #1966

Investigate / review / setup node selection logic #1970

vexingly opened this issue Sep 18, 2024 · 9 comments
Assignees

Comments

@vexingly
Copy link

vexingly commented Sep 18, 2024

The nodepools will be updated in #1967 but we will require some logic for node selection.

  1. Keep existing logic for notebooks that have higher than 14 CPU to be launched on the "usercpuXXxx" nodes
  2. Add taint/toleration for scheduling OpenM++ jobs to use these larger CPU based nodes
  3. Investigate other methods for scheduling different scenarios? The UAT will provide more details on how users schedule their jobs that could be helpful for working out a longer term strategy

Existing logic for notebooks is here: https://github.com/StatCan/aaw-toleration-injector/blob/main/mutate.go

@jacek-dudek
Copy link

Started looking at the code in the toleration injector controller. Will study it some more and post follow up questions here and request comments from Pat and other people

@jacek-dudek
Copy link

jacek-dudek commented Oct 16, 2024

Studied some alternative methods of node selection. There is one based on nodeSelector field and node labels, another one based on affinity field that allows for more expressive conditions on node labels and also distinguishes between required and preferential conditions, finally there is one based on node taints and corresponding tolerations.

@jacek-dudek
Copy link

Pat, could you comment on what types of kubernetes workloads are expected to be openm workloads? How will they be distinguished from other workloads? Do we have a set of labels in mind that will be applied to the pod manifests?

And do you prefer a particular node selection method to be used over others (ie toleration injection versus nodeSelector or affinity specified in pod specs)?

@vexingly
Copy link
Author

Hi @jacek-dudek, I think there are two workloads to consider:

  1. Users creating notebook servers in kubeflow and running their workload directly in the notebook:
    I think if we are moving to 16CPU default nodes, then the current logic can probably stay as it is, i.e. users are restricted to 14CPU notebooks and perhaps we don't allow them to use more than that with this type of workload

  2. Users who want to submit a kubernetes job or mpijob using a specific manifest (ether manually or via the OpenM++ UI and a template, I would prefer to keep using labels, like the big-cpu label.

I think we would need a new openm/microsimulation specific label to target a d64 node pool, is that what you were thinking @Souheil-Yazji ?

@Souheil-Yazji
Copy link
Contributor

Souheil-Yazji commented Oct 17, 2024

@jacek-dudek @vexingly

Just at an initial glance, it seems the best approach is to always have the users submit their Open M jobs as a separate workload. This will allow us to build the foundation for MPI jobs in the future, if that ever becomes functional.

This would also limit the cost factor for users scaling larger notebooks to run jobs but then idle resources after.
If the users run their Open M jobs in isolated pods, which terminate once complete, this will be perfect for:

  • costing purposes because resources will scale down after complete
  • monitoring/logging because of the container-level isolation, then optimize resource provisioning based on monitoring results
  • pushing all workloads to a different nodepool to prevent resource contention, and the nodepool can fully scale down once users are no longer working (but this does introduce annoying ~5min latency for first job)
  • users can omit the node selector if they want to just run the workload container on the native nodepool (which will probably be much smaller than the cpu optimized nodes)
  • The work used to make the UI submit MPI Jobs can be re-used, but instead, submit regular podspecs with OpenM jobs instead
  • in the case of AAW, the nodes which run the notebook are tainted, therefore the small jobs running on them will need those tolerations as well.

Whether we use a node selector label or taint/toleration isn't very problematic.

@vexingly
Copy link
Author

The two scenarios that I can see for users not submitting the jobs as a separate workload are:

  1. Users not familiar with this workflow find it more complex and unless we can make it transparent they will have some issues adjusting / need some time to work up to a separate job workflow

  2. When doing very small runs for building scripts it would be easier / less complex to run locally, but I don't expect them to use many resources for this type of work

@Souheil-Yazji
Copy link
Contributor

Souheil-Yazji commented Oct 30, 2024

@vexingly
Next steps for this:

  • Advise users on best-practices :
    1. Run small jobs on their own notebook, to avoid large costs for scaling up expensive infra.
    2. Run Big Jobs using custom OpenM Job Manifests
    3. Define what "Small" and "Big" are or at least provide a suggestion
  • Create a custom OpenM manifest template, which is submit-able by end-users, which includes the appropriate labels/tolerations to schedule the jobs to the Big-Cpu nodepool, which is currently only at 1 Node per pool. Either:
    1. Add a Nodepool with a different VMSS type:
    2. Increase NodePool limit for Big CPU

@vexingly
Copy link
Author

I think notebooks should target intermittent workloads of ~4 CPU and should over provision / expect some slowness for multiple users, non-production runs more development and testing / configuring.

When you say big-cpu do you mean the 72 core machines? Is that what we will use for the time being? They may not have enough memory for some users workloads, although the CPU's are sufficient. We will need more nodes for sure, I think each of the 4 projects have a quota of 200 CPU currently.

@Souheil-Yazji
Copy link
Contributor

We'll need concrete resource scales to understand if f72 machines are appropriate,
@vexingly let me know if we have those numbers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants