You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When deploying Stable Diffusion XL Base Model with Inferentia, Ray Serve, the client receives a 504. After checking Ray dashboard, we see a health check failure for replicas. Digging into dead actor logs we see a kernel crash dump preceded by a memory allocation error as shown in the attached crash dump.
This seems to be an issue with amazon-eks-gpu-node-1.29-v20240729. This was working with amazon-eks-gpu-node-1.29-v20240703. But since we don't pin the AMI in karpenter it picked up the new AMI and we started seeing the error. We should go through all our blueprints and pin everything.
The text was updated successfully, but these errors were encountered:
Description
When deploying Stable Diffusion XL Base Model with Inferentia, Ray Serve, the client receives a 504. After checking Ray dashboard, we see a health check failure for replicas. Digging into dead actor logs we see a kernel crash dump preceded by a memory allocation error as shown in the attached crash dump.
dump.txt
Steps to reproduce the behavior:
Deploy https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/StableDiffusion-inf2
Expected behavior
Works as documentend
Actual behavior
Ray replicas crash with a healthcheck failure.
Additional context
This seems to be an issue with
amazon-eks-gpu-node-1.29-v20240729
. This was working withamazon-eks-gpu-node-1.29-v20240703
. But since we don't pin the AMI in karpenter it picked up the new AMI and we started seeing the error. We should go through all our blueprints and pin everything.The text was updated successfully, but these errors were encountered: