Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add P series example nodegroup to jark stack example #465

Merged
merged 9 commits into from
Mar 16, 2024

Conversation

alanty
Copy link
Contributor

@alanty alanty commented Mar 12, 2024

What does this PR do?

Adds a nodegroup definition for a P5.48xlarge group, with EFA devices, a Capacity Reservation, and RAID0 over the NVMe devices. That nodegroup can be used fora P4 instance with minor changes noted in the comments.
Closes #370

This also adds the EFA plugin from the data addons, this should be usable on any instance with EFA devices not just the P series.

Motivation

Add an example/reference for the P series with the plugins and config for use in the jark stack.

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

The example in this repo looks to be based on CUDA11 and Nvidia 470 drivers. These look to have issues with the P5, CUDA12 and the Nvidia 535 drivers. I've got a rough update in progress, we'll work on a different PR for that change.

@alanty alanty changed the title Add P series example nodegroup to jark stack example [feat] Add P series example nodegroup to jark stack example Mar 12, 2024
Copy link
Contributor

@lusoal lusoal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just need clarification around how we should deal with the commented section

@@ -147,5 +147,66 @@ module "eks" {
Name = "gpu-node-grp"
})
}

# # This nodegroup can be used for P4/P5 instances with, or without, a Capacity Reservation.
# #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanty is this comment required? We need to make sure that we have the instructions when to remove the comments of this section, might want to add into the website folder documentation for the JARK stack

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lusoal, the JARK stack currently missing documentation for the DoEKS website. I agree that adding a section to explain how to enable and deploy the P5 nodegroup with EFA and CBR would be beneficial. We can do that as a part JARK deployment Website doc.

@vara-bonthu vara-bonthu requested a review from askulkarni2 March 13, 2024 22:59
@@ -147,5 +147,66 @@ module "eks" {
Name = "gpu-node-grp"
})
}

# # This nodegroup can be used for P4/P5 instances with, or without, a Capacity Reservation.
# #
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lusoal, the JARK stack currently missing documentation for the DoEKS website. I agree that adding a section to explain how to enable and deploy the P5 nodegroup with EFA and CBR would be beneficial. We can do that as a part JARK deployment Website doc.

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vara-bonthu vara-bonthu merged commit 7173cd9 into awslabs:main Mar 16, 2024
52 of 54 checks passed
@alanty alanty deleted the jark-p5 branch July 29, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU node group (P4d/P5) configuration with EFA, CBR and RAID0 for SSD
3 participants