[PROPOSAL] Apply GPU acceleration to ML Node in k8s operator #906

YeonghyeonKO · 2024-11-18T01:41:36Z

What are you proposing?

As @stevapple has already suggested #832 how to override CUDA image in nodePools[] in opensearch k8s oprator, this issue is an extended discussion with that concept.

Provision GPU-dedicated Worker Node(s) in the existing K8s cluster.
Prepare CUDA-based image for OpenSearch
Deploy OpenSearch ml-node(s) using nodeSelector in opensearch k8s operator.
- CR has already offered us nodeSelector property as below:

  nodePools:
    - component: ml
      replicas: 3
      diskSize: "10Gi"
      nodeSelector: # <- assign ml node(ie. Pod) to GPU Worker Node 
      roles:
        - "ml"
      resources:
        requests:
        limits:

Test whether ML Node does use GPU resource or not.

But this is all for now because we can't set a GPU-enabled image for opensearch, except opensearchCluster.general.image in yaml.
Is there any further progress in this idea?
For implementing GPU acceleration - OpenSearch Documentation, we need guidelines and an exact standard for it.

The text was updated successfully, but these errors were encountered:

prudhvigodithi · 2024-11-21T23:01:33Z

[Triage]
Hey @YeonghyeonKO coming from my previous comment #832 (comment), today CUDA built-in OpenSearch images are not officially release by the project (related issue opensearch-project/opensearch-build#4743), is there already an open source OpenSearch CUDA supported image ?

Coming from operator side yes its a limitation, custom image at nodePool level spec.nodePools[0].image is not supported in the NodePool struct, is there a way you can PR this change ?

Regarding GPU acceleration - OpenSearch Documentation, Adding @dblock to provide some guidelines.

Thank you
@stevapple @getsaurabh02 @peterzhuamazon @gaiksaya @bshien @swoehrl-mw

peterzhuamazon · 2024-11-22T19:13:52Z

My thoughts:
opensearch-project/opensearch-build#4743 (comment)

prudhvigodithi · 2024-11-22T19:14:06Z

Adding @vamshin to provide some thoughts on CUDA OpenSearch development, more than operator I feel this issue should be part of ml-commons repo to continue the discussion related to GPU acceleration - OpenSearch Documentation.
Thank you

vamshin · 2024-11-22T19:21:22Z

@YeonghyeonKO could you please help us with use cases for GPU with ml nodes?

YeonghyeonKO · 2024-11-22T22:20:53Z

@prudhvigodithi Thanks for continuing the discussion. I totally agree with you that it's prior to build cuda image than nodePools[] thing.

YeonghyeonKO · 2024-11-22T22:29:35Z

@vamshin Sure, I would happily help as a tester for deploying cuda image to hosts(k8s worker nodes) where nvidia toolkit has been installed in.

YeonghyeonKO · 2024-11-26T09:37:43Z

@vamshin
For hosting CUDA-based image for OpenSearch, I've added v100 type GPU node to k8s clusters. (see the below spec)

Capacity:
  cpu:                80
  ephemeral-storage:  204700Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527862092Ki
  nvidia.com/gpu:     8
  pods:               110

ML Nodes will be deployed in this worker node(GPU). I'd already tested the deployment of them via opensearch-k8s-operator using the existing opensearch docker image instead of CUDA-based image. To avoid being disturbed by nodes with other roles, the property of nodeSelector and tolerations is used.

  nodePools:
    - component: ml
      replicas: 3
      nodeSelector:
        gpu: "true"
      tolerations:
      - key: node.kubernetes.io/unschedulable
        operator: "Exists"
        effect: "NoSchedule"

github-project-automation bot added this to Engineering Effectiveness Board Nov 18, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Nov 18, 2024

github-actions bot added the untriaged Issues that have not yet been triaged label Nov 18, 2024

prudhvigodithi added enhancement New feature or request question User questions. Neither a bug nor feature request. and removed untriaged Issues that have not yet been triaged labels Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Apply GPU acceleration to ML Node in k8s operator #906

[PROPOSAL] Apply GPU acceleration to ML Node in k8s operator #906

YeonghyeonKO commented Nov 18, 2024 •

edited

Loading

prudhvigodithi commented Nov 21, 2024

peterzhuamazon commented Nov 22, 2024

prudhvigodithi commented Nov 22, 2024

vamshin commented Nov 22, 2024

YeonghyeonKO commented Nov 22, 2024

YeonghyeonKO commented Nov 22, 2024

YeonghyeonKO commented Nov 26, 2024 •

edited

Loading

[PROPOSAL] Apply GPU acceleration to ML Node in k8s operator #906

[PROPOSAL] Apply GPU acceleration to ML Node in k8s operator #906

Comments

YeonghyeonKO commented Nov 18, 2024 • edited Loading

What are you proposing?

prudhvigodithi commented Nov 21, 2024

peterzhuamazon commented Nov 22, 2024

prudhvigodithi commented Nov 22, 2024

vamshin commented Nov 22, 2024

YeonghyeonKO commented Nov 22, 2024

YeonghyeonKO commented Nov 22, 2024

YeonghyeonKO commented Nov 26, 2024 • edited Loading

YeonghyeonKO commented Nov 18, 2024 •

edited

Loading

YeonghyeonKO commented Nov 26, 2024 •

edited

Loading