Ray on EKS - failed to schedule XGBoost benchmark sample job #433

prasanthponnoth · 2024-02-14T16:47:09Z

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

[ x] ✋ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following for Terraform examples:

Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
Re-initialize the project root to pull down modules: terraform init
Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Module version [Required]:
Terraform version:

Provider version(s):

Reproduction Code [Required]

I followed the steps outlined in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray to deploy a Ray Cluster for XGBoost benchmark sample job. The RayCluster head node pods are failed to schedule and error messages is below.

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  2m26s (x5 over 22m)  default-scheduler  0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient ephemeral-storage, 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

It seems like, either karpenter is unable to provision EC2 instances for Xgboost (as spot instance?) or requires different instance types as recommended in this Ray page (i.e. 10 nodes, 6 CPU and 64 Gi memory per node and m5.4xlarge in AWS) Ray Train XGBoostTrainer on Kubernetes

In addition, the documentation is wrong in one place (minor typos/issues)

(./install .sh - an extra space that needs to be removed)

Steps to reproduce the behavior:

Just follow the instructions outlined in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray. It provisions EKS clusters but fails to provision pods after terraform apply for xgboost

Expected behavior

As documented in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray

Actual behavior

Terminal Output Screenshot(s)

Additional context

The text was updated successfully, but these errors were encountered:

askulkarni2 self-assigned this Feb 14, 2024

askulkarni2 added the bug Something isn't working label Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray on EKS - failed to schedule XGBoost benchmark sample job #433

Ray on EKS - failed to schedule XGBoost benchmark sample job #433

prasanthponnoth commented Feb 14, 2024

Ray on EKS - failed to schedule XGBoost benchmark sample job #433

Ray on EKS - failed to schedule XGBoost benchmark sample job #433

Comments

prasanthponnoth commented Feb 14, 2024

Description

⚠️ Note

Versions

Reproduction Code [Required]

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context