Inference on EKS - Trouble Deploying Ray Serve Cluster on EKS #446

AditModi · 2024-02-23T11:46:36Z

[ x] ✋ I have searched the open/closed issues and my issue is not listed.

Please describe your question here

I am encountering an issue while deploying Ray Serve Cluster on an EKS cluster following the documentation provided here.

The EKS cluster is created successfully using Terraform, but the Ray Serve Cluster remains in a pending state. When I check the describe output, the root error is as follows:

Normal   Nominated          20m                 karpenter           Pod should schedule on: nodeclaim/default-x9bvr
Normal   Nominated          4m50s               karpenter           Pod should schedule on: nodeclaim/default-shq8s
Warning  FailedScheduling   20s (x5 over 20m)   default-scheduler   0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Normal   NotTriggerScaleUp  6s (x121 over 20m)  cluster-autoscaler  pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector


#### Provide a link to the example/module related to the question

Just followed the documentation provided [here](https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/Llama2). 

#### Additional context

I have successfully created the EKS cluster using Terraform, but the Ray Serve Cluster deployment remains stuck in the pending state. The error seems related to node affinity/selector issues, and the cluster autoscaler is not triggering a scale-up.

The text was updated successfully, but these errors were encountered:

askulkarni2 · 2024-02-23T13:46:19Z

@sanjeevrg89 can you please take a look? My guess is Karpenter is not able to provision the inf2 instance.

vara-bonthu · 2024-02-26T06:59:43Z

@AditModi It's not reproducible on the latest blueprint. Please try again and let us know.

github-actions · 2024-05-15T00:06:36Z

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

Gall-oDrone · 2024-06-14T01:54:26Z

I have the same error. I'm following along the Gen AI examples but I'm able unable to even run a single Llama2 model. I'm trying to run the models with inf2.8xl (two instances) and inf2-24xl (one instance).
After typing kubectl get all -n llama2, I get the following:
`NAME READY STATUS RESTARTS AGE
pod/llama2-raycluster-fcmtr-head-bf58d 0/1 Pending 0 67m
pod/llama2-raycluster-fcmtr-worker-inf2-lgnb2 0/1 Pending 0 5m30s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/llama2 ClusterIP 172.20.118.243 10001/TCP,8000/TCP,8080/TCP,6379/TCP,8265/TCP 67m
service/llama2-head-svc ClusterIP 172.20.168.94 8080/TCP,6379/TCP,8265/TCP,10001/TCP,8000/TCP 57m
service/llama2-serve-svc ClusterIP 172.20.61.167 8000/TCP 57m

NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
raycluster.ray.io/llama2-raycluster-fcmtr 1 92 704565270Ki 0 ready 67m

NAME SERVICE STATUS NUM SERVE ENDPOINTS
rayservice.ray.io/llama2 Restarting 2
`

Observations:

I'm using the latest Blueprint version
Set Worker node minimum size and Worker node desired size to both 2 and 1 for each inf2.8xl and inf2-24xl, respectively.

Gall-oDrone · 2024-07-16T21:16:24Z

@AditModi It's not reproducible on the latest blueprint. Please try again and let us know.

Hi, I'm encountering the same issue in latest blueprint. For any of the following models I've been encountered the same issue:

Llama2 on Inferentia
Llama3 on Inferentia
Mistral-7B on Inferentia
RayServe with vLLM: For this one, I created a similar issue: Error while running RayServe with vLLM blueprint template #589 (comment)

P.S: Is it possible to run those models with a lower EC2 Inf2.8xlarge instances instead of 48xlarge?
AWS Tech support is limiting service quota vCPUs for inf2 instances

vara-bonthu added the bug Something isn't working label Feb 26, 2024

vara-bonthu added author-action-required and removed bug Something isn't working labels Apr 14, 2024

github-actions bot added the stale label May 15, 2024

github-actions bot removed the stale label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference on EKS - Trouble Deploying Ray Serve Cluster on EKS #446

Inference on EKS - Trouble Deploying Ray Serve Cluster on EKS #446

AditModi commented Feb 23, 2024

askulkarni2 commented Feb 23, 2024

vara-bonthu commented Feb 26, 2024

github-actions bot commented May 15, 2024

Gall-oDrone commented Jun 14, 2024 •

edited

Loading

Gall-oDrone commented Jul 16, 2024 •

edited

Loading

Inference on EKS - Trouble Deploying Ray Serve Cluster on EKS #446

Inference on EKS - Trouble Deploying Ray Serve Cluster on EKS #446

Comments

AditModi commented Feb 23, 2024

Please describe your question here

askulkarni2 commented Feb 23, 2024

vara-bonthu commented Feb 26, 2024

github-actions bot commented May 15, 2024

Gall-oDrone commented Jun 14, 2024 • edited Loading

Observations:

Gall-oDrone commented Jul 16, 2024 • edited Loading

Gall-oDrone commented Jun 14, 2024 •

edited

Loading

Gall-oDrone commented Jul 16, 2024 •

edited

Loading