-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: JARK stack update with Argo Workflows, Karpenter and KubeRay #487
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the updates and think i follow the example files, we could really use docs on this stack.
ai-ml/jark-stack/terraform/addons.tf
Outdated
enable_kuberay_operator = true | ||
enable_nvidia_device_plugin = true | ||
nvidia_device_plugin_helm_config = { | ||
version = "v0.15.0-rc.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there features we need in the RC release?
imo if we're using the release candidate we should ensure we update this again when the "full" 0.15.x release is out(or better, bump the addon default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We will need to keep this up to date on an ongoing basis based on the number of issues we keep encountering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I had to change to the latest version to verify the NVIDIA Device plugin issue that we were facing. It turns out to be an EKS AMI issue but not the NVIDIA device plugin.
I will change this to stable version v0.14.5
for now
it's always better for us to lock the blueprints addons to a specific version to avoid failures with newer releases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use 0.14.5 major release for now instead of release candidate
nvidia_device_plugin_helm_config = { | ||
version = "v0.15.0-rc.2" | ||
name = "nvidia-device-plugin" | ||
values = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor structure thing.
this PR moves the Nvidia plugin values inline at ~188, I added a similar toleration block for the efa device plugin in aws-efa-k8s-device-plugin-values.yaml. Is there a preference or style to follow?
I like having the smaller values blocks inline as it makes things a bit easier to read without having to jump between files. Though the values in a separate file is a bit easier syntax and more like helm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't necessarily have a preference one way or the other. In general the guideline is small changes can go inline, large changes in a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Apoorva. If it's a small config easy to embed into the code otherwise, create a dedicated yaml file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the awesome PR! LGTM!
ai-ml/jark-stack/terraform/addons.tf
Outdated
enable_kuberay_operator = true | ||
enable_nvidia_device_plugin = true | ||
nvidia_device_plugin_helm_config = { | ||
version = "v0.15.0-rc.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We will need to keep this up to date on an ongoing basis based on the number of issues we keep encountering.
nvidia_device_plugin_helm_config = { | ||
version = "v0.15.0-rc.2" | ||
name = "nvidia-device-plugin" | ||
values = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't necessarily have a preference one way or the other. In general the guideline is small changes can go inline, large changes in a separate file.
securityGroupSelectorTerms: | ||
tags: | ||
Name: ${module.eks.cluster_name}-node | ||
instanceStorePolicy: RAID0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@@ -13,7 +13,7 @@ variable "region" { | |||
|
|||
variable "eks_cluster_version" { | |||
description = "EKS Cluster version" | |||
default = "1.27" | |||
default = "1.29" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Overall, LGTM as we. Thank you @vara-bonthu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, approving the request
ai-ml/jark-stack/terraform/addons.tf
Outdated
enable_kuberay_operator = true | ||
enable_nvidia_device_plugin = true | ||
nvidia_device_plugin_helm_config = { | ||
version = "v0.15.0-rc.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use 0.14.5 major release for now instead of release candidate
What does this PR do?
🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.
NOTE: Website Doc for this deployment will be raised as a Part2 PR
Motivation
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes