-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Add P series example nodegroup to jark stack example #465
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just need clarification around how we should deal with the commented section
@@ -147,5 +147,66 @@ module "eks" { | |||
Name = "gpu-node-grp" | |||
}) | |||
} | |||
|
|||
# # This nodegroup can be used for P4/P5 instances with, or without, a Capacity Reservation. | |||
# # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanty is this comment required? We need to make sure that we have the instructions when to remove the comments of this section, might want to add into the website folder documentation for the JARK stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lusoal, the JARK stack currently missing documentation for the DoEKS website. I agree that adding a section to explain how to enable and deploy the P5 nodegroup with EFA and CBR would be beneficial. We can do that as a part JARK deployment Website doc.
@@ -147,5 +147,66 @@ module "eks" { | |||
Name = "gpu-node-grp" | |||
}) | |||
} | |||
|
|||
# # This nodegroup can be used for P4/P5 instances with, or without, a Capacity Reservation. | |||
# # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lusoal, the JARK stack currently missing documentation for the DoEKS website. I agree that adding a section to explain how to enable and deploy the P5 nodegroup with EFA and CBR would be beneficial. We can do that as a part JARK deployment Website doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do?
Adds a nodegroup definition for a P5.48xlarge group, with EFA devices, a Capacity Reservation, and RAID0 over the NVMe devices. That nodegroup can be used fora P4 instance with minor changes noted in the comments.
Closes #370
This also adds the EFA plugin from the data addons, this should be usable on any instance with EFA devices not just the P series.
Motivation
Add an example/reference for the P series with the plugins and config for use in the jark stack.
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes
The example in this repo looks to be based on CUDA11 and Nvidia 470 drivers. These look to have issues with the P5, CUDA12 and the Nvidia 535 drivers. I've got a rough update in progress, we'll work on a different PR for that change.