Skip to content

Commit

Permalink
Merge branch 'main' into spark-updates-v4
Browse files Browse the repository at this point in the history
  • Loading branch information
alanty authored Oct 30, 2024
2 parents 7de1b22 + 0b5bb56 commit 07b1e42
Show file tree
Hide file tree
Showing 15 changed files with 137 additions and 112 deletions.
2 changes: 1 addition & 1 deletion ai-ml/emr-spark-rapids/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
| <a name="input_enable_nvidia_gpu_operator"></a> [enable\_nvidia\_gpu\_operator](#input\_enable\_nvidia\_gpu\_operator) | Enable NVIDIA GPU Operator | `bool` | `false` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-spark-rapids"` | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br> "100.64.0.0/16"<br>]</pre> | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br/> "100.64.0.0/16"<br/>]</pre> | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no |

Expand Down
4 changes: 2 additions & 2 deletions ai-ml/nvidia-triton-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,9 +79,9 @@
| <a name="input_huggingface_token"></a> [huggingface\_token](#input\_huggingface\_token) | Hugging Face Secret Token | `string` | `"DUMMY_TOKEN_REPLACE_ME"` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"nvidia-triton-server"` | no |
| <a name="input_ngc_api_key"></a> [ngc\_api\_key](#input\_ngc\_api\_key) | NGC API Key | `string` | `"DUMMY_NGC_API_KEY_REPLACE_ME"` | no |
| <a name="input_nim_models"></a> [nim\_models](#input\_nim\_models) | NVIDIA NIM Models | <pre>list(object({<br> name = string<br> id = string<br> enable = bool<br> num_gpu = string<br> }))</pre> | <pre>[<br> {<br> "enable": false,<br> "id": "nvcr.io/nim/meta/llama-3.1-8b-instruct",<br> "name": "llama-3-1-8b-instruct",<br> "num_gpu": "4"<br> },<br> {<br> "enable": true,<br> "id": "nvcr.io/nim/meta/llama3-8b-instruct",<br> "name": "llama3-8b-instruct",<br> "num_gpu": "1"<br> }<br>]</pre> | no |
| <a name="input_nim_models"></a> [nim\_models](#input\_nim\_models) | NVIDIA NIM Models | <pre>list(object({<br/> name = string<br/> id = string<br/> enable = bool<br/> num_gpu = string<br/> }))</pre> | <pre>[<br/> {<br/> "enable": false,<br/> "id": "nvcr.io/nim/meta/llama-3.1-8b-instruct",<br/> "name": "llama-3-1-8b-instruct",<br/> "num_gpu": "4"<br/> },<br/> {<br/> "enable": true,<br/> "id": "nvcr.io/nim/meta/llama3-8b-instruct",<br/> "name": "llama3-8b-instruct",<br/> "num_gpu": "1"<br/> }<br/>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | region | `string` | `"us-west-2"` | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br> "100.64.0.0/16"<br>]</pre> | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br/> "100.64.0.0/16"<br/>]</pre> | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no |

## Outputs
Expand Down
117 changes: 39 additions & 78 deletions ai-ml/trainium-inferentia/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -106,34 +106,6 @@ module "eks_blueprints_addons" {
values = [templatefile("${path.module}/helm-values/cluster-autoscaler-values.yaml", {})]
}

#---------------------------------------
# Karpenter Autoscaler for EKS Cluster
#---------------------------------------
# NOTE: Karpenter Upgrade
# This Helm Chart addon will only install the CRD during the first installation of the helm chart.
# Subsequent Helm Chart chart upgrades will not add or remove CRDs, even if the CRDs have changed.
# If you need to upgrade the CRDs, you will need to manually run the following commands and ensure that the CRDs are updated before upgrading the Helm Chart.
# READ the guide before applying the CRDs: https://karpenter.sh/preview/upgrade-guide/
# kubectl apply -f https://raw.githubusercontent.com/aws/karpenter/main/pkg/apis/crds/karpenter.sh_provisioners.yaml
# kubectl apply -f https://raw.githubusercontent.com/aws/karpenter/main/pkg/apis/crds/karpenter.sh_machines.yaml
# kubectl apply -f https://raw.githubusercontent.com/aws/karpenter/main/pkg/apis/crds/karpenter.k8s.aws_awsnodetemplates.yaml
#---------------------------------------
#---------------------------------------
# Karpenter Autoscaler for EKS Cluster
#---------------------------------------
enable_karpenter = true
karpenter_enable_spot_termination = true
karpenter_node = {
iam_role_additional_policies = {
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
}
karpenter = {
chart_version = "0.37.0"
repository_username = data.aws_ecrpublic_authorization_token.token.user_name
repository_password = data.aws_ecrpublic_authorization_token.token.password
}

#---------------------------------------
# Enable FSx for Lustre CSI Driver
#---------------------------------------
Expand Down Expand Up @@ -218,23 +190,39 @@ module "eks_blueprints_addons" {
tags = local.tags
}

resource "aws_eks_access_entry" "this" {
cluster_name = module.eks.cluster_name
principal_arn = module.eks_blueprints_addons.karpenter.node_iam_role_arn
type = "EC2_LINUX"
}

#---------------------------------------------------------------
# Data on EKS Kubernetes Addons
#---------------------------------------------------------------
module "eks_data_addons" {
source = "aws-ia/eks-data-addons/aws"
version = "1.33.0" # ensure to update this to the latest/desired version
version = "1.35.0" # ensure to update this to the latest/desired version

oidc_provider_arn = module.eks.oidc_provider_arn

enable_aws_neuron_device_plugin = true

aws_neuron_device_plugin_helm_config = {
# Enable default scheduler
values = [
<<-EOT
devicePlugin:
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: aws.amazon.com/neuron
operator: Exists
effect: NoSchedule
- key: hub.jupyter.org/dedicated
operator: Exists
effect: NoSchedule
scheduler:
enabled: true
npd:
enabled: false
EOT
]
}

enable_aws_efa_k8s_device_plugin = true

aws_efa_k8s_device_plugin_helm_config = {
Expand Down Expand Up @@ -287,7 +275,7 @@ module "eks_data_addons" {
name: trainium-trn1
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
karpenterRole: ${module.karpenter.node_iam_role_name}
subnetSelectorTerms:
id: ${module.vpc.private_subnets[2]}
securityGroupSelectorTerms:
Expand All @@ -300,6 +288,8 @@ module "eks_data_addons" {
volumeType: gp3
encrypted: true
deleteOnTermination: true
amiSelectorTerms:
- alias: al2023@v20241024
nodePool:
labels:
- instanceType: trainium-trn1
Expand Down Expand Up @@ -339,7 +329,7 @@ module "eks_data_addons" {
name: inferentia-inf2
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
karpenterRole: ${module.karpenter.node_iam_role_name}
subnetSelectorTerms:
id: ${module.vpc.private_subnets[2]}
securityGroupSelectorTerms:
Expand All @@ -352,6 +342,8 @@ module "eks_data_addons" {
volumeType: gp3
encrypted: true
deleteOnTermination: true
amiSelectorTerms:
- alias: al2023@v20241024
nodePool:
labels:
- instanceType: inferentia-inf2
Expand All @@ -374,7 +366,7 @@ module "eks_data_addons" {
values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: [ "spot", "on-demand"]
values: [ "on-demand"]
limits:
cpu: 1000
disruption:
Expand All @@ -390,19 +382,21 @@ module "eks_data_addons" {
<<-EOT
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
karpenterRole: ${module.karpenter.node_iam_role_name}
subnetSelectorTerms:
id: ${module.vpc.private_subnets[2]}
securityGroupSelectorTerms:
id: ${module.eks.node_security_group_id}
tags:
Name: ${module.eks.cluster_name}-node
blockDevice:
deviceName: /dev/xvda
volumeSize: 200Gi
volumeType: gp3
encrypted: true
deleteOnTermination: true
blockDevice:
deviceName: /dev/xvda
volumeSize: 200Gi
volumeType: gp3
encrypted: true
deleteOnTermination: true
amiSelectorTerms:
- alias: al2023@v20241024
nodePool:
labels:
- instanceType: mixed-x86
Expand Down Expand Up @@ -537,36 +531,3 @@ resource "kubectl_manifest" "mpi_operator" {
yaml_body = each.value
depends_on = [module.eks.eks_cluster_id]
}

#---------------------------------------------------------------
# Neuron Scheduler deployment
# The YAML manifest contents for Neuron Scheduler will be replaced in future by Neuron Helm Chart
#---------------------------------------------------------------

data "http" "neuron_scheduler" {
url = "https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_downloads/e739253083129abeaf6f6ad1db7ccb21/my-scheduler.yml"
}

data "kubectl_file_documents" "neuron_scheduler" {
content = data.http.neuron_scheduler.response_body
}

resource "kubectl_manifest" "neuron_scheduler" {
for_each = data.kubectl_file_documents.neuron_scheduler.manifests
yaml_body = each.value
depends_on = [module.eks.eks_cluster_id]
}

data "http" "k8s_neuron_scheduler_eks" {
url = "https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_downloads/e518187532701b6660dcd70ea28c2562/k8s-neuron-scheduler-eks.yml"
}

data "kubectl_file_documents" "k8s_neuron_scheduler_eks" {
content = data.http.k8s_neuron_scheduler_eks.response_body
}

resource "kubectl_manifest" "k8s_neuron_scheduler_eks" {
for_each = data.kubectl_file_documents.k8s_neuron_scheduler_eks.manifests
yaml_body = each.value
depends_on = [module.eks.eks_cluster_id]
}
64 changes: 64 additions & 0 deletions ai-ml/trainium-inferentia/eks.tf
Original file line number Diff line number Diff line change
Expand Up @@ -343,4 +343,68 @@ module "eks" {
})
}
}

tags = merge(local.tags, {
# NOTE - if creating multiple security groups with this module, only tag the
# security group that Karpenter should utilize with the following tag
# (i.e. - at most, only one security group should have this tag in your account)
"karpenter.sh/discovery" = local.name
})
}


################################################################################
# Karpenter Controller & Node IAM roles, SQS Queue, Eventbridge Rules
################################################################################

module "karpenter" {
source = "terraform-aws-modules/eks/aws//modules/karpenter"
version = "~> 20.24"

cluster_name = module.eks.cluster_name
enable_v1_permissions = true

# Use Pod Identity
enable_pod_identity = true
create_pod_identity_association = true

# Used to attach additional IAM policies to the Karpenter node IAM role
node_iam_role_additional_policies = {
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

tags = local.tags
}

################################################################################
# Karpenter Helm chart
################################################################################

resource "helm_release" "karpenter" {
name = "karpenter"
namespace = "kube-system"
create_namespace = true
repository = "oci://public.ecr.aws/karpenter"
repository_username = data.aws_ecrpublic_authorization_token.token.user_name
repository_password = data.aws_ecrpublic_authorization_token.token.password
chart = "karpenter"
version = "1.0.6"
wait = true

values = [
<<-EOT
settings:
clusterName: ${module.eks.cluster_name}
clusterEndpoint: ${module.eks.cluster_endpoint}
interruptionQueue: ${module.karpenter.queue_name}
serviceAccount:
name: ${module.karpenter.service_account}
EOT
]

lifecycle {
ignore_changes = [
repository_password
]
}
}
4 changes: 2 additions & 2 deletions analytics/terraform/datahub-on-eks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
| <a name="input_enable_vpc_endpoints"></a> [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"datahub-on-eks"` | no |
| <a name="input_private_subnet_ids"></a> [private\_subnet\_ids](#input\_private\_subnet\_ids) | Ids for existing private subnets - needed when create\_vpc set to false | `list(string)` | `[]` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.0.0/17",<br> "10.1.128.0/18"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.255.128/26",<br> "10.1.255.192/26"<br>]</pre> | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.0.0/17",<br/> "10.1.128.0/18"<br/>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.255.128/26",<br/> "10.1.255.192/26"<br/>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR - must change to match the cidr of the existing VPC if create\_vpc set to false | `string` | `"10.1.0.0/16"` | no |
Expand Down
4 changes: 2 additions & 2 deletions analytics/terraform/emr-eks-ack/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
|------|-------------|------|---------|:--------:|
| <a name="input_eks_cluster_version"></a> [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.27"` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-ack"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.0.0/17",<br> "10.1.128.0/18"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.255.128/26",<br> "10.1.255.192/26"<br>]</pre> | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.0.0/17",<br/> "10.1.128.0/18"<br/>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.255.128/26",<br/> "10.1.255.192/26"<br/>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no |
Expand Down
4 changes: 2 additions & 2 deletions analytics/terraform/emr-eks-fargate/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
|------|-------------|------|---------|:--------:|
| <a name="input_eks_cluster_version"></a> [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.27"` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-fargate"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.0.0/17",<br> "10.1.128.0/18"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.255.128/26",<br> "10.1.255.192/26"<br>]</pre> | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.0.0/17",<br/> "10.1.128.0/18"<br/>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.255.128/26",<br/> "10.1.255.192/26"<br/>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no |
Expand Down
2 changes: 1 addition & 1 deletion analytics/terraform/emr-eks-karpenter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
| <a name="input_enable_yunikorn"></a> [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `false` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-karpenter"` | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br> "100.64.0.0/16"<br>]</pre> | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br/> "100.64.0.0/16"<br/>]</pre> | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no |

Expand Down
Loading

0 comments on commit 07b1e42

Please sign in to comment.