Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference on EKS - Trouble Deploying Ray Serve Cluster on EKS #446

Open
AditModi opened this issue Feb 23, 2024 · 5 comments
Open

Inference on EKS - Trouble Deploying Ray Serve Cluster on EKS #446

AditModi opened this issue Feb 23, 2024 · 5 comments

Comments

@AditModi
Copy link

  • [ x] ✋ I have searched the open/closed issues and my issue is not listed.

Please describe your question here

I am encountering an issue while deploying Ray Serve Cluster on an EKS cluster following the documentation provided here.

The EKS cluster is created successfully using Terraform, but the Ray Serve Cluster remains in a pending state. When I check the describe output, the root error is as follows:

Normal   Nominated          20m                 karpenter           Pod should schedule on: nodeclaim/default-x9bvr
Normal   Nominated          4m50s               karpenter           Pod should schedule on: nodeclaim/default-shq8s
Warning  FailedScheduling   20s (x5 over 20m)   default-scheduler   0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Normal   NotTriggerScaleUp  6s (x121 over 20m)  cluster-autoscaler  pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector


#### Provide a link to the example/module related to the question

Just followed the documentation provided [here](https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/Llama2). 

#### Additional context

I have successfully created the EKS cluster using Terraform, but the Ray Serve Cluster deployment remains stuck in the pending state. The error seems related to node affinity/selector issues, and the cluster autoscaler is not triggering a scale-up.
@askulkarni2
Copy link
Collaborator

@sanjeevrg89 can you please take a look? My guess is Karpenter is not able to provision the inf2 instance.

@vara-bonthu
Copy link
Collaborator

@AditModi It's not reproducible on the latest blueprint. Please try again and let us know.

@vara-bonthu vara-bonthu added the bug Something isn't working label Feb 26, 2024
@vara-bonthu vara-bonthu added author-action-required and removed bug Something isn't working labels Apr 14, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

@github-actions github-actions bot added the stale label May 15, 2024
@Gall-oDrone
Copy link

Gall-oDrone commented Jun 14, 2024

I have the same error. I'm following along the Gen AI examples but I'm able unable to even run a single Llama2 model. I'm trying to run the models with inf2.8xl (two instances) and inf2-24xl (one instance).
After typing kubectl get all -n llama2, I get the following:
`NAME READY STATUS RESTARTS AGE
pod/llama2-raycluster-fcmtr-head-bf58d 0/1 Pending 0 67m
pod/llama2-raycluster-fcmtr-worker-inf2-lgnb2 0/1 Pending 0 5m30s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/llama2 ClusterIP 172.20.118.243 10001/TCP,8000/TCP,8080/TCP,6379/TCP,8265/TCP 67m
service/llama2-head-svc ClusterIP 172.20.168.94 8080/TCP,6379/TCP,8265/TCP,10001/TCP,8000/TCP 57m
service/llama2-serve-svc ClusterIP 172.20.61.167 8000/TCP 57m

NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
raycluster.ray.io/llama2-raycluster-fcmtr 1 92 704565270Ki 0 ready 67m

NAME SERVICE STATUS NUM SERVE ENDPOINTS
rayservice.ray.io/llama2 Restarting 2
`

Observations:

  • I'm using the latest Blueprint version
  • Set Worker node minimum size and Worker node desired size to both 2 and 1 for each inf2.8xl and inf2-24xl, respectively.

@Gall-oDrone
Copy link

Gall-oDrone commented Jul 16, 2024

@AditModi It's not reproducible on the latest blueprint. Please try again and let us know.

Hi, I'm encountering the same issue in latest blueprint. For any of the following models I've been encountered the same issue:

  1. Llama2 on Inferentia
  2. Llama3 on Inferentia
  3. Mistral-7B on Inferentia
  4. RayServe with vLLM: For this one, I created a similar issue: Error while running RayServe with vLLM blueprint template  #589 (comment)

P.S: Is it possible to run those models with a lower EC2 Inf2.8xlarge instances instead of 48xlarge?
AWS Tech support is limiting service quota vCPUs for inf2 instances

@github-actions github-actions bot removed the stale label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants