You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m26s (x5 over 22m) default-scheduler 0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient ephemeral-storage, 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
It seems like, either karpenter is unable to provision EC2 instances for Xgboost (as spot instance?) or requires different instance types as recommended in this Ray page (i.e. 10 nodes, 6 CPU and 64 Gi memory per node and m5.4xlarge in AWS) Ray Train XGBoostTrainer on Kubernetes
In addition, the documentation is wrong in one place (minor typos/issues)
(./install .sh - an extra space that needs to be removed)
Description
Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.
If your request is for a new feature, please use the
Feature request
template.Before you submit an issue, please perform the following for Terraform examples:
.terraform
directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!):rm -rf .terraform/
terraform init
Versions
Module version [Required]:
Terraform version:
Reproduction Code [Required]
I followed the steps outlined in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray to deploy a Ray Cluster for XGBoost benchmark sample job. The RayCluster head node pods are failed to schedule and error messages is below.
It seems like, either karpenter is unable to provision EC2 instances for Xgboost (as spot instance?) or requires different instance types as recommended in this Ray page (i.e. 10 nodes, 6 CPU and 64 Gi memory per node and m5.4xlarge in AWS) Ray Train XGBoostTrainer on Kubernetes
In addition, the documentation is wrong in one place (minor typos/issues)
(./install .sh - an extra space that needs to be removed)
Steps to reproduce the behavior:
Just follow the instructions outlined in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray. It provisions EKS clusters but fails to provision pods after terraform apply for xgboost
Expected behavior
As documented in https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/ray
Actual behavior
Terminal Output Screenshot(s)
Additional context
The text was updated successfully, but these errors were encountered: