Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray on EKS - failed to schedule XGBoost benchmark sample job #433

Open
prasanthponnoth opened this issue Feb 14, 2024 · 0 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@prasanthponnoth
Copy link

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

  • [ x] ✋ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following for Terraform examples:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

  • Module version [Required]:

  • Terraform version:

  • Provider version(s):

Reproduction Code [Required]

I followed the steps outlined in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray to deploy a Ray Cluster for XGBoost benchmark sample job. The RayCluster head node pods are failed to schedule and error messages is below.

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  2m26s (x5 over 22m)  default-scheduler  0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient ephemeral-storage, 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

It seems like, either karpenter is unable to provision EC2 instances for Xgboost (as spot instance?) or requires different instance types as recommended in this Ray page (i.e. 10 nodes, 6 CPU and 64 Gi memory per node and m5.4xlarge in AWS) Ray Train XGBoostTrainer on Kubernetes

In addition, the documentation is wrong in one place (minor typos/issues)

(./install .sh - an extra space that needs to be removed)

Steps to reproduce the behavior:

Just follow the instructions outlined in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray. It provisions EKS clusters but fails to provision pods after terraform apply for xgboost

Expected behavior

As documented in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray

Actual behavior

Terminal Output Screenshot(s)

Additional context

@askulkarni2 askulkarni2 self-assigned this Feb 14, 2024
@askulkarni2 askulkarni2 added the bug Something isn't working label Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants