Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Apply GPU acceleration to ML Node in k8s operator #906

Open
YeonghyeonKO opened this issue Nov 18, 2024 · 7 comments
Open

[PROPOSAL] Apply GPU acceleration to ML Node in k8s operator #906

YeonghyeonKO opened this issue Nov 18, 2024 · 7 comments
Labels
enhancement New feature or request question User questions. Neither a bug nor feature request.

Comments

@YeonghyeonKO
Copy link

YeonghyeonKO commented Nov 18, 2024

What are you proposing?

As @stevapple has already suggested #832 how to override CUDA image in nodePools[] in opensearch k8s oprator, this issue is an extended discussion with that concept.

  1. Provision GPU-dedicated Worker Node(s) in the existing K8s cluster.
  2. Prepare CUDA-based image for OpenSearch
  3. Deploy OpenSearch ml-node(s) using nodeSelector in opensearch k8s operator.
    • CR has already offered us nodeSelector property as below:
  nodePools:
    - component: ml
      replicas: 3
      diskSize: "10Gi"
      nodeSelector: # <- assign ml node(ie. Pod) to GPU Worker Node 
      roles:
        - "ml"
      resources:
        requests:
        limits:
  1. Test whether ML Node does use GPU resource or not.

But this is all for now because we can't set a GPU-enabled image for opensearch, except opensearchCluster.general.image in yaml.
Is there any further progress in this idea?
For implementing GPU acceleration - OpenSearch Documentation, we need guidelines and an exact standard for it.

@prudhvigodithi
Copy link
Member

[Triage]
Hey @YeonghyeonKO coming from my previous comment #832 (comment), today CUDA built-in OpenSearch images are not officially release by the project (related issue opensearch-project/opensearch-build#4743), is there already an open source OpenSearch CUDA supported image ?

Coming from operator side yes its a limitation, custom image at nodePool level spec.nodePools[0].image is not supported in the NodePool struct, is there a way you can PR this change ?

Regarding GPU acceleration - OpenSearch Documentation, Adding @dblock to provide some guidelines.

Thank you
@stevapple @getsaurabh02 @peterzhuamazon @gaiksaya @bshien @swoehrl-mw

@prudhvigodithi prudhvigodithi added enhancement New feature or request question User questions. Neither a bug nor feature request. and removed untriaged Issues that have not yet been triaged labels Nov 21, 2024
@peterzhuamazon
Copy link
Member

@prudhvigodithi
Copy link
Member

Adding @vamshin to provide some thoughts on CUDA OpenSearch development, more than operator I feel this issue should be part of ml-commons repo to continue the discussion related to GPU acceleration - OpenSearch Documentation.
Thank you

@vamshin
Copy link
Member

vamshin commented Nov 22, 2024

@YeonghyeonKO could you please help us with use cases for GPU with ml nodes?

@YeonghyeonKO
Copy link
Author

@prudhvigodithi Thanks for continuing the discussion. I totally agree with you that it's prior to build cuda image than nodePools[] thing.

@YeonghyeonKO
Copy link
Author

@vamshin Sure, I would happily help as a tester for deploying cuda image to hosts(k8s worker nodes) where nvidia toolkit has been installed in.

@YeonghyeonKO
Copy link
Author

YeonghyeonKO commented Nov 26, 2024

@vamshin
For hosting CUDA-based image for OpenSearch, I've added v100 type GPU node to k8s clusters. (see the below spec)

Capacity:
  cpu:                80
  ephemeral-storage:  204700Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527862092Ki
  nvidia.com/gpu:     8
  pods:               110

ML Nodes will be deployed in this worker node(GPU). I'd already tested the deployment of them via opensearch-k8s-operator using the existing opensearch docker image instead of CUDA-based image. To avoid being disturbed by nodes with other roles, the property of nodeSelector and tolerations is used.

  nodePools:
    - component: ml
      replicas: 3
      nodeSelector:
        gpu: "true"
      tolerations:
      - key: node.kubernetes.io/unschedulable
        operator: "Exists"
        effect: "NoSchedule"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question User questions. Neither a bug nor feature request.
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants