From 3ca25f429adebb70da6cdf7e3d1a00dd8b22ff93 Mon Sep 17 00:00:00 2001 From: Doruk Ozturk Date: Wed, 13 Mar 2024 18:21:36 -0400 Subject: [PATCH] feat: Bionemo on Eks (#457) Co-authored-by: Vara Bonthu --- ai-ml/bionemo/README.md | 63 ++++ ai-ml/bionemo/addons.tf | 54 +++ ai-ml/bionemo/cleanup.sh | 45 +++ ai-ml/bionemo/eks.tf | 146 ++++++++ .../training/esm1nv_pretrain-job.yaml | 76 ++++ .../examples/training/uniref50-job.yaml | 36 ++ ai-ml/bionemo/fsx-for-lustre.tf | 136 +++++++ .../fsx-for-lustre/fsxlustre-static-pv.yaml | 21 ++ .../fsx-for-lustre/fsxlustre-static-pvc.yaml | 12 + .../fsxlustre-storage-class.yaml | 9 + .../aws-cloudwatch-metrics-values.yaml | 11 + ai-ml/bionemo/install.sh | 34 ++ ai-ml/bionemo/main.tf | 53 +++ ai-ml/bionemo/outputs.tf | 9 + ai-ml/bionemo/variables.tf | 32 ++ ai-ml/bionemo/versions.tf | 37 ++ ai-ml/bionemo/vpc.tf | 57 +++ website/docs/gen-ai/training/Llama2.md | 2 +- website/docs/gen-ai/training/bionemo.md | 344 ++++++++++++++++++ 19 files changed, 1176 insertions(+), 1 deletion(-) create mode 100644 ai-ml/bionemo/README.md create mode 100644 ai-ml/bionemo/addons.tf create mode 100644 ai-ml/bionemo/cleanup.sh create mode 100644 ai-ml/bionemo/eks.tf create mode 100644 ai-ml/bionemo/examples/training/esm1nv_pretrain-job.yaml create mode 100644 ai-ml/bionemo/examples/training/uniref50-job.yaml create mode 100644 ai-ml/bionemo/fsx-for-lustre.tf create mode 100644 ai-ml/bionemo/fsx-for-lustre/fsxlustre-static-pv.yaml create mode 100644 ai-ml/bionemo/fsx-for-lustre/fsxlustre-static-pvc.yaml create mode 100644 ai-ml/bionemo/fsx-for-lustre/fsxlustre-storage-class.yaml create mode 100755 ai-ml/bionemo/helm-values/aws-cloudwatch-metrics-values.yaml create mode 100644 ai-ml/bionemo/install.sh create mode 100644 ai-ml/bionemo/main.tf create mode 100644 ai-ml/bionemo/outputs.tf create mode 100644 ai-ml/bionemo/variables.tf create mode 100644 ai-ml/bionemo/versions.tf create mode 100644 ai-ml/bionemo/vpc.tf create mode 100644 website/docs/gen-ai/training/bionemo.md diff --git a/ai-ml/bionemo/README.md b/ai-ml/bionemo/README.md new file mode 100644 index 000000000..94583e4eb --- /dev/null +++ b/ai-ml/bionemo/README.md @@ -0,0 +1,63 @@ +## Requirements + +| Name | Version | +|------|---------| +| [terraform](#requirement\_terraform) | >= 1.0.0 | +| [aws](#requirement\_aws) | >= 3.72 | +| [helm](#requirement\_helm) | >= 2.4.1 | +| [http](#requirement\_http) | >= 3.3 | +| [kubectl](#requirement\_kubectl) | >= 1.14 | +| [kubernetes](#requirement\_kubernetes) | >= 2.10 | +| [random](#requirement\_random) | 3.3.2 | + +## Providers + +| Name | Version | +|------|---------| +| [aws](#provider\_aws) | 5.38.0 | +| [http](#provider\_http) | 3.4.1 | +| [kubectl](#provider\_kubectl) | 1.14.0 | + +## Modules + +| Name | Source | Version | +|------|--------|---------| +| [ebs\_csi\_driver\_irsa](#module\_ebs\_csi\_driver\_irsa) | terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks | ~> 5.20 | +| [eks](#module\_eks) | terraform-aws-modules/eks/aws | ~> 19.15 | +| [eks\_blueprints\_addons](#module\_eks\_blueprints\_addons) | aws-ia/eks-blueprints-addons/aws | ~> 1.3 | +| [eks\_data\_addons](#module\_eks\_data\_addons) | aws-ia/eks-data-addons/aws | ~> 1.2.3 | +| [fsx\_s3\_bucket](#module\_fsx\_s3\_bucket) | terraform-aws-modules/s3-bucket/aws | ~> 3.0 | +| [vpc](#module\_vpc) | terraform-aws-modules/vpc/aws | ~> 5.0 | + +## Resources + +| Name | Type | +|------|------| +| [aws_fsx_data_repository_association.this](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/fsx_data_repository_association) | resource | +| [aws_fsx_lustre_file_system.this](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/fsx_lustre_file_system) | resource | +| [aws_security_group.fsx](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource | +| [kubectl_manifest.mpi_operator](https://registry.terraform.io/providers/gavinbunney/kubectl/latest/docs/resources/manifest) | resource | +| [kubectl_manifest.static_pv](https://registry.terraform.io/providers/gavinbunney/kubectl/latest/docs/resources/manifest) | resource | +| [kubectl_manifest.static_pvc](https://registry.terraform.io/providers/gavinbunney/kubectl/latest/docs/resources/manifest) | resource | +| [kubectl_manifest.storage_class](https://registry.terraform.io/providers/gavinbunney/kubectl/latest/docs/resources/manifest) | resource | +| [aws_availability_zones.available](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/availability_zones) | data source | +| [aws_eks_cluster_auth.this](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/eks_cluster_auth) | data source | +| [http_http.mpi_operator_yaml](https://registry.terraform.io/providers/hashicorp/http/latest/docs/data-sources/http) | data source | +| [kubectl_file_documents.mpi_operator_yaml](https://registry.terraform.io/providers/gavinbunney/kubectl/latest/docs/data-sources/file_documents) | data source | + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.29"` | no | +| [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"bionemo-on-eks"` | no | +| [region](#input\_region) | Region | `string` | `"us-west-2"` | no | +| [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` |
[
"100.64.0.0/16"
]
| no | +| [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/21"` | no | + +## Outputs + +| Name | Description | +|------|-------------| +| [configure\_kubectl](#output\_configure\_kubectl) | Configure kubectl: make sure you're logged in with the correct AWS profile and run the following command to update your kubeconfig | +| [eks\_api\_server\_url](#output\_eks\_api\_server\_url) | Your eks API server endpoint | diff --git a/ai-ml/bionemo/addons.tf b/ai-ml/bionemo/addons.tf new file mode 100644 index 000000000..74c152dc4 --- /dev/null +++ b/ai-ml/bionemo/addons.tf @@ -0,0 +1,54 @@ +#--------------------------------------------------------------- +# EKS Blueprints Kubernetes Addons +#--------------------------------------------------------------- +module "eks_blueprints_addons" { + source = "aws-ia/eks-blueprints-addons/aws" + version = "~> 1.3" + + cluster_name = module.eks.cluster_name + cluster_endpoint = module.eks.cluster_endpoint + cluster_version = module.eks.cluster_version + oidc_provider_arn = module.eks.oidc_provider_arn + + #--------------------------------------- + # Amazon EKS Managed Add-ons + #--------------------------------------- + eks_addons = { + coredns = { + preserve = true + } + vpc-cni = { + preserve = true + } + kube-proxy = { + preserve = true + } + } + #--------------------------------------- + # CloudWatch metrics for EKS + #--------------------------------------- + enable_aws_cloudwatch_metrics = true + aws_cloudwatch_metrics = { + values = [templatefile("${path.module}/helm-values/aws-cloudwatch-metrics-values.yaml", {})] + } + + #--------------------------------------- + # Enable FSx for Lustre CSI Driver + #--------------------------------------- + enable_aws_fsx_csi_driver = true + + tags = local.tags + +} + +#--------------------------------------------------------------- +# Data on EKS Kubernetes Addons +#--------------------------------------------------------------- +module "eks_data_addons" { + source = "aws-ia/eks-data-addons/aws" + version = "~> 1.30" # ensure to update this to the latest/desired version + + oidc_provider_arn = module.eks.oidc_provider_arn + enable_nvidia_device_plugin = true + +} diff --git a/ai-ml/bionemo/cleanup.sh b/ai-ml/bionemo/cleanup.sh new file mode 100644 index 000000000..da1fb7c16 --- /dev/null +++ b/ai-ml/bionemo/cleanup.sh @@ -0,0 +1,45 @@ +#!/bin/bash +set -o errexit +set -o pipefail + +targets=( + "module.eks" + "module.vpc" +) + +#------------------------------------------- +# Helpful to delete the stuck in "Terminating" namespaces +# Rerun the cleanup.sh script to detect and delete the stuck resources +#------------------------------------------- +terminating_namespaces=$(kubectl get namespaces --field-selector status.phase=Terminating -o json | jq -r '.items[].metadata.name') + +# If there are no terminating namespaces, exit the script +if [[ -z $terminating_namespaces ]]; then + echo "No terminating namespaces found" +fi + +for ns in $terminating_namespaces; do + echo "Terminating namespace: $ns" + kubectl get namespace $ns -o json | sed 's/"kubernetes"//' | kubectl replace --raw "/api/v1/namespaces/$ns/finalize" -f - +done + +for target in "${targets[@]}" +do + terraform destroy -target="$target" -auto-approve + destroy_output=$(terraform destroy -target="$target" -auto-approve 2>&1) + if [[ $? -eq 0 && $destroy_output == *"Destroy complete!"* ]]; then + echo "SUCCESS: Terraform destroy of $target completed successfully" + else + echo "FAILED: Terraform destroy of $target failed" + exit 1 + fi +done + +terraform destroy -auto-approve +destroy_output=$(terraform destroy -auto-approve 2>&1) +if [[ $? -eq 0 && $destroy_output == *"Destroy complete!"* ]]; then + echo "SUCCESS: Terraform destroy of all targets completed successfully" +else + echo "FAILED: Terraform destroy of all targets failed" + exit 1 +fi diff --git a/ai-ml/bionemo/eks.tf b/ai-ml/bionemo/eks.tf new file mode 100644 index 000000000..2f2c5d2f1 --- /dev/null +++ b/ai-ml/bionemo/eks.tf @@ -0,0 +1,146 @@ +#--------------------------------------------------------------- +# EKS Cluster +#--------------------------------------------------------------- +module "eks" { + source = "terraform-aws-modules/eks/aws" + version = "~> 19.15" + + cluster_name = local.name + cluster_version = var.eks_cluster_version + cluster_endpoint_public_access = true # if true, Your cluster API server is accessible from the internet. You can, optionally, limit the CIDR blocks that can access the public endpoint. + vpc_id = module.vpc.vpc_id + subnet_ids = module.vpc.private_subnets + manage_aws_auth_configmap = true + + #--------------------------------------- + # Note: This can further restricted to specific required for each Add-on and your application + #--------------------------------------- + # Extend cluster security group rules + cluster_security_group_additional_rules = { + ingress_nodes_ephemeral_ports_tcp = { + description = "Nodes on ephemeral ports" + protocol = "tcp" + from_port = 1025 + to_port = 65535 + type = "ingress" + source_node_security_group = true + } + } + + # Extend node-to-node security group rules + node_security_group_additional_rules = { + ingress_self_all = { + description = "Node to node all ports/protocols" + protocol = "-1" + from_port = 0 + to_port = 0 + type = "ingress" + self = true + } + # Allows Control Plane Nodes to talk to Worker nodes on all ports. Added this to simplify the example and further avoid issues with Add-ons communication with Control plane. + # This can be restricted further to specific port based on the requirement for each Add-on e.g., metrics-server 4443, spark-operator 8080, karpenter 8443 etc. + # Change this according to your security requirements if needed + ingress_cluster_to_node_all_traffic = { + description = "Cluster API to Nodegroup all traffic" + protocol = "-1" + from_port = 0 + to_port = 0 + type = "ingress" + source_cluster_security_group = true + } + } + + eks_managed_node_group_defaults = { + iam_role_additional_policies = { + # Not required, but used in the example to access the nodes to inspect mounted volumes + AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" + } + } + + eks_managed_node_groups = { + # We recommend to have a MNG to place your critical workloads and add-ons + # Then rely on Karpenter to scale your workloads + # You can also make uses on nodeSelector and Taints/tolerations to spread workloads on MNG or Karpenter provisioners + + core_node_group = { + name = "core-node-group" + description = "EKS Core node group for hosting critical add-ons" + # Filtering only Secondary CIDR private subnets starting with "100.". + # Subnet IDs where the nodes/node groups will be provisioned + subnet_ids = compact([for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) : + substr(cidr_block, 0, 4) == "100." ? subnet_id : null] + ) + + min_size = 3 + max_size = 9 + desired_size = 3 + + instance_types = ["m5.xlarge"] + + ebs_optimized = true + block_device_mappings = { + xvda = { + device_name = "/dev/xvda" + ebs = { + volume_size = 100 + volume_type = "gp3" + } + } + } + + labels = { + WorkerType = "ON_DEMAND" + NodeGroupType = "core" + } + + tags = merge(local.tags, { + Name = "core-node-grp", + "karpenter.sh/discovery" = local.name + }) + } + + gpu1 = { + name = "gpu-node-grp" + description = "EKS Node Group to run GPU workloads" + # Filtering only Secondary CIDR private subnets starting with "100.". + # Subnet IDs where the nodes/node groups will be provisioned + subnet_ids = compact([for subnet_id, cidr_block in zipmap(module.vpc.private_subnets, module.vpc.private_subnets_cidr_blocks) : + substr(cidr_block, 0, 4) == "100." ? subnet_id : null] + ) + + ami_type = "AL2_x86_64_GPU" + ami_release_version = "1.29.0-20240213" + min_size = 2 + max_size = 3 + desired_size = 2 + + instance_types = ["p3.16xlarge"] + ebs_optimized = true + block_device_mappings = { + xvda = { + device_name = "/dev/xvda" + ebs = { + volume_size = 200 + volume_type = "gp3" + } + } + } + taints = { + gpu = { + key = "nvidia.com/gpu" + effect = "NO_SCHEDULE" + operator = "EXISTS" + } + } + labels = { + WorkerType = "ON_DEMAND" + eks-node = "gpu" + } + + tags = merge(local.tags, { + Name = "gpu-node-grp", + "karpenter.sh/discovery" = local.name + }) + } + } +} diff --git a/ai-ml/bionemo/examples/training/esm1nv_pretrain-job.yaml b/ai-ml/bionemo/examples/training/esm1nv_pretrain-job.yaml new file mode 100644 index 000000000..36479b99e --- /dev/null +++ b/ai-ml/bionemo/examples/training/esm1nv_pretrain-job.yaml @@ -0,0 +1,76 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: esm1nv-pretraining +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 16 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 80 + nprocPerNode: "8" + pytorchReplicaSpecs: + Worker: + replicas: 16 + template: + metadata: + annotations: + sidecar.istio.io/inject: "false" + spec: + tolerations: + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + volumes: + - name: fsx-pv-storage + persistentVolumeClaim: + claimName: fsx-static-pvc + containers: + - name: pytorch + image: nvcr.io/nvidia/clara/bionemo-framework:1.2 + resources: + limits: + nvidia.com/gpu: 1 + env: + - name: NCCL_DEBUG + value: "INFO" + - name: DATA_PATH + value: "/fsx" + - name: HYDRA_FULL_ERROR + value: "1" + volumeMounts: + - mountPath: "/fsx" + name: fsx-pv-storage + imagePullPolicy: Always + command: + - "python3" + - "-m" + - "torch.distributed.run" + - "/workspace/bionemo/examples/protein/esm1nv/pretrain.py" + - "--config-path=/workspace/bionemo/examples/protein/esm1nv/conf" + - "--config-name=pretrain_small" + - "exp_manager.exp_dir=/fsx/esm1nv-train/esm1nv_pretraining/esm1nv_batch256_gradacc1_nodes2-small/results" + - "exp_manager.create_wandb_logger=False" + - "exp_manager.wandb_logger_kwargs.name=esm1nv_batch256_gradacc1_nodes2-small" + - "exp_manager.wandb_logger_kwargs.project=esm1nv_pretraining" + - "++exp_manager.wandb_logger_kwargs.offline=False" + - "trainer.num_nodes=2" + - "trainer.devices=8" + - "trainer.max_steps=1000000" + - "trainer.accumulate_grad_batches=1" + - "trainer.val_check_interval=500" + - "model.micro_batch_size=8" + - "model.tensor_model_parallel_size=1" + - "model.data.dataset_path=/fsx/processed" + - "model.data.dataset.train='x_OP_000..049_CL_'" + - "model.data.dataset.val='x_OP_000..049_CL_'" + - "model.data.dataset.test='x_OP_000..049_CL_'" + - "model.data.index_mapping_dir=/fsx/processed" + - "++model.dwnstr_task_validation.enabled=False" diff --git a/ai-ml/bionemo/examples/training/uniref50-job.yaml b/ai-ml/bionemo/examples/training/uniref50-job.yaml new file mode 100644 index 000000000..babdec124 --- /dev/null +++ b/ai-ml/bionemo/examples/training/uniref50-job.yaml @@ -0,0 +1,36 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: uniref50-download +spec: + ttlSecondsAfterFinished: 100 + template: + spec: + volumes: + - name: fsx-pv-storage + persistentVolumeClaim: + claimName: fsx-static-pvc + containers: + - name: bionemo + image: nvcr.io/nvidia/clara/bionemo-framework:1.2 + resources: + limits: + cpu: 2000m + memory: 4Gi + requests: + cpu: 1000m + memory: 2Gi + env: + - name: DATA_PATH + value: "/fsx" + command: ["/bin/sh", "-c"] + args: + - | + echo "from bionemo.data import UniRef50Preprocess" > /tmp/prepare_uniref50.py + echo "data = UniRef50Preprocess(root_directory='/fsx')" >> /tmp/prepare_uniref50.py + echo "data.prepare_dataset(source='uniprot')" >> /tmp/prepare_uniref50.py + python3 /tmp/prepare_uniref50.py + volumeMounts: + - mountPath: "/fsx" + name: fsx-pv-storage + restartPolicy: Never diff --git a/ai-ml/bionemo/fsx-for-lustre.tf b/ai-ml/bionemo/fsx-for-lustre.tf new file mode 100644 index 000000000..2175461f0 --- /dev/null +++ b/ai-ml/bionemo/fsx-for-lustre.tf @@ -0,0 +1,136 @@ +#--------------------------------------------------------------- +# FSx for Lustre File system Static provisioning +# 1> Create Fsx for Lustre filesystem (Lustre FS storage capacity must be 1200, 2400, or a multiple of 3600) +# 2> Create Storage Class for Filesystem (Cluster scoped) +# 3> Persistent Volume with Hardcoded reference to Fsx for Lustre filesystem with filesystem_id and dns_name (Cluster scoped) +# 4> Persistent Volume claim for this persistent volume will always use the same file system (Namespace scoped) +#--------------------------------------------------------------- +# NOTE: FSx for Lustre file system creation can take up to 10 mins +resource "aws_fsx_lustre_file_system" "this" { + deployment_type = "PERSISTENT_2" + storage_type = "SSD" + per_unit_storage_throughput = "500" # 125, 250, 500, 1000 + storage_capacity = 2400 + + subnet_ids = [module.vpc.private_subnets[0]] + security_group_ids = [aws_security_group.fsx.id] + log_configuration { + level = "WARN_ERROR" + } + tags = merge({ "Name" : "${local.name}-static" }, local.tags) +} + +# This process can take upto 7 mins +resource "aws_fsx_data_repository_association" "this" { + + file_system_id = aws_fsx_lustre_file_system.this.id + data_repository_path = "s3://${module.fsx_s3_bucket.s3_bucket_id}" + file_system_path = "/data" # This directory will be used in Spark podTemplates under volumeMounts as subPath + + s3 { + auto_export_policy { + events = ["NEW", "CHANGED", "DELETED"] + } + + auto_import_policy { + events = ["NEW", "CHANGED", "DELETED"] + } + } +} + +#--------------------------------------------------------------- +# Sec group for FSx for Lustre +#--------------------------------------------------------------- +resource "aws_security_group" "fsx" { + + name = "${local.name}-fsx" + description = "Allow inbound traffic from private subnets of the VPC to FSx filesystem" + vpc_id = module.vpc.vpc_id + + ingress { + description = "Allows Lustre traffic between Lustre clients" + cidr_blocks = module.vpc.private_subnets_cidr_blocks + from_port = 1021 + to_port = 1023 + protocol = "tcp" + } + ingress { + description = "Allows Lustre traffic between Lustre clients" + cidr_blocks = module.vpc.private_subnets_cidr_blocks + from_port = 988 + to_port = 988 + protocol = "tcp" + } + tags = local.tags +} + +#--------------------------------------------------------------- +# S3 bucket for DataSync between FSx for Lustre and S3 Bucket +#--------------------------------------------------------------- +#tfsec:ignore:aws-s3-enable-bucket-logging tfsec:ignore:aws-s3-enable-versioning +module "fsx_s3_bucket" { + source = "terraform-aws-modules/s3-bucket/aws" + version = "~> 3.0" + + create_bucket = true + + bucket_prefix = "${local.name}-fsx-" + # For example only - please evaluate for your environment + force_destroy = true + + server_side_encryption_configuration = { + rule = { + apply_server_side_encryption_by_default = { + sse_algorithm = "AES256" + } + } + } +} + +#--------------------------------------------------------------- +# Storage Class - FSx for Lustre +#--------------------------------------------------------------- +resource "kubectl_manifest" "storage_class" { + + yaml_body = templatefile("${path.module}/fsx-for-lustre/fsxlustre-storage-class.yaml", { + subnet_id = module.vpc.private_subnets[0], + security_group_id = aws_security_group.fsx.id + }) + + depends_on = [ + module.eks_blueprints_addons + ] +} + +#--------------------------------------------------------------- +# FSx for Lustre Persistent Volume - Static provisioning +#--------------------------------------------------------------- +resource "kubectl_manifest" "static_pv" { + + yaml_body = templatefile("${path.module}/fsx-for-lustre/fsxlustre-static-pv.yaml", { + filesystem_id = aws_fsx_lustre_file_system.this.id, + dns_name = aws_fsx_lustre_file_system.this.dns_name + mount_name = aws_fsx_lustre_file_system.this.mount_name, + }) + + depends_on = [ + module.eks_blueprints_addons, + kubectl_manifest.storage_class, + aws_fsx_lustre_file_system.this + ] +} + +#--------------------------------------------------------------- +# FSx for Lustre Persistent Volume Claim +#--------------------------------------------------------------- +resource "kubectl_manifest" "static_pvc" { + + yaml_body = templatefile("${path.module}/fsx-for-lustre/fsxlustre-static-pvc.yaml", {}) + + depends_on = [ + module.eks_blueprints_addons, + kubectl_manifest.storage_class, + kubectl_manifest.static_pv, + aws_fsx_lustre_file_system.this + ] +} diff --git a/ai-ml/bionemo/fsx-for-lustre/fsxlustre-static-pv.yaml b/ai-ml/bionemo/fsx-for-lustre/fsxlustre-static-pv.yaml new file mode 100644 index 000000000..857bdcf3a --- /dev/null +++ b/ai-ml/bionemo/fsx-for-lustre/fsxlustre-static-pv.yaml @@ -0,0 +1,21 @@ +--- +apiVersion: v1 +kind: PersistentVolume +metadata: + name: fsx-static-pv +spec: + capacity: + storage: 1000Gi + volumeMode: Filesystem + storageClassName: fsx + accessModes: + - ReadWriteMany + mountOptions: + - flock + persistentVolumeReclaimPolicy: Recycle + csi: + driver: fsx.csi.aws.com + volumeHandle: ${filesystem_id} + volumeAttributes: + dnsname: ${dns_name} + mountname: ${mount_name} diff --git a/ai-ml/bionemo/fsx-for-lustre/fsxlustre-static-pvc.yaml b/ai-ml/bionemo/fsx-for-lustre/fsxlustre-static-pvc.yaml new file mode 100644 index 000000000..dddebd66c --- /dev/null +++ b/ai-ml/bionemo/fsx-for-lustre/fsxlustre-static-pvc.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: fsx-static-pvc +spec: + accessModes: + - ReadWriteMany + storageClassName: fsx + resources: + requests: + storage: 1000Gi + volumeName: fsx-static-pv diff --git a/ai-ml/bionemo/fsx-for-lustre/fsxlustre-storage-class.yaml b/ai-ml/bionemo/fsx-for-lustre/fsxlustre-storage-class.yaml new file mode 100644 index 000000000..125fb2478 --- /dev/null +++ b/ai-ml/bionemo/fsx-for-lustre/fsxlustre-storage-class.yaml @@ -0,0 +1,9 @@ +--- +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + name: fsx +provisioner: fsx.csi.aws.com +parameters: + subnetId: ${subnet_id} + securityGroupIds: ${security_group_id} diff --git a/ai-ml/bionemo/helm-values/aws-cloudwatch-metrics-values.yaml b/ai-ml/bionemo/helm-values/aws-cloudwatch-metrics-values.yaml new file mode 100755 index 000000000..ae3c41d44 --- /dev/null +++ b/ai-ml/bionemo/helm-values/aws-cloudwatch-metrics-values.yaml @@ -0,0 +1,11 @@ +resources: + limits: + cpu: 500m + memory: 2Gi + requests: + cpu: 200m + memory: 1Gi + +# This toleration allows Daemonset pod to be scheduled on any node, regardless of their Taints. +tolerations: + - operator: Exists diff --git a/ai-ml/bionemo/install.sh b/ai-ml/bionemo/install.sh new file mode 100644 index 000000000..8430565fc --- /dev/null +++ b/ai-ml/bionemo/install.sh @@ -0,0 +1,34 @@ +#!/bin/bash + +# List of Terraform modules to apply in sequence +targets=( + "module.vpc" + "module.eks" +) + +# Initialize Terraform +echo "Initializing ..." +terraform init --upgrade || echo "\"terraform init\" failed" + +# Apply modules in sequence +for target in "${targets[@]}" +do + echo "Applying module $target..." + apply_output=$(terraform apply -target="$target" -auto-approve 2>&1 | tee /dev/tty) + if [[ ${PIPESTATUS[0]} -eq 0 && $apply_output == *"Apply complete"* ]]; then + echo "SUCCESS: Terraform apply of $target completed successfully" + else + echo "FAILED: Terraform apply of $target failed" + exit 1 + fi +done + +# Final apply to catch any remaining resources +echo "Applying remaining resources..." +apply_output=$(terraform apply -auto-approve 2>&1 | tee /dev/tty) +if [[ ${PIPESTATUS[0]} -eq 0 && $apply_output == *"Apply complete"* ]]; then + echo "SUCCESS: Terraform apply of all modules completed successfully" +else + echo "FAILED: Terraform apply of all modules failed" + exit 1 +fi diff --git a/ai-ml/bionemo/main.tf b/ai-ml/bionemo/main.tf new file mode 100644 index 000000000..dd7d220a0 --- /dev/null +++ b/ai-ml/bionemo/main.tf @@ -0,0 +1,53 @@ +provider "aws" { + region = local.region +} + +provider "kubernetes" { + host = module.eks.cluster_endpoint + cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data) + token = data.aws_eks_cluster_auth.this.token +} + +# ECR always authenticates with `us-east-1` region +# Docs -> https://docs.aws.amazon.com/AmazonECR/latest/public/public-registries.html +provider "aws" { + alias = "ecr" + region = "us-east-1" +} + +provider "helm" { + kubernetes { + host = module.eks.cluster_endpoint + cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data) + token = data.aws_eks_cluster_auth.this.token + } +} + +provider "kubectl" { + apply_retry_count = 10 + host = module.eks.cluster_endpoint + cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data) + load_config_file = false + token = data.aws_eks_cluster_auth.this.token +} + +data "aws_availability_zones" "available" {} + +data "aws_eks_cluster_auth" "this" { + name = module.eks.cluster_name +} + +#--------------------------------------------------------------- +# Local variables +#--------------------------------------------------------------- +locals { + name = var.name + region = var.region + vpc_cidr = var.vpc_cidr + azs = slice(data.aws_availability_zones.available.names, 0, 2) + + tags = { + Blueprint = local.name + GithubRepo = "github.com/awslabs/data-on-eks" + } +} diff --git a/ai-ml/bionemo/outputs.tf b/ai-ml/bionemo/outputs.tf new file mode 100644 index 000000000..0f7edf2c1 --- /dev/null +++ b/ai-ml/bionemo/outputs.tf @@ -0,0 +1,9 @@ +output "configure_kubectl" { + description = "Configure kubectl: make sure you're logged in with the correct AWS profile and run the following command to update your kubeconfig" + value = "aws eks --region ${local.region} update-kubeconfig --alias ${module.eks.cluster_name} --name ${module.eks.cluster_name}" +} + +output "eks_api_server_url" { + description = "Your eks API server endpoint" + value = module.eks.cluster_endpoint +} diff --git a/ai-ml/bionemo/variables.tf b/ai-ml/bionemo/variables.tf new file mode 100644 index 000000000..cb8e33168 --- /dev/null +++ b/ai-ml/bionemo/variables.tf @@ -0,0 +1,32 @@ +variable "name" { + description = "Name of the VPC and EKS Cluster" + default = "bionemo-on-eks" + type = string +} + +variable "region" { + description = "Region" + type = string + default = "us-west-2" +} + +variable "eks_cluster_version" { + description = "EKS Cluster version" + default = "1.29" + type = string +} + +# VPC with 2046 IPs (10.1.0.0/21) and 2 AZs +variable "vpc_cidr" { + description = "VPC CIDR" + default = "10.1.0.0/21" + type = string +} + +# RFC6598 range 100.64.0.0/10 +# Note you can only /16 range to VPC. You can add multiples of /16 if required +variable "secondary_cidr_blocks" { + description = "Secondary CIDR blocks to be attached to VPC" + default = ["100.64.0.0/16"] + type = list(string) +} diff --git a/ai-ml/bionemo/versions.tf b/ai-ml/bionemo/versions.tf new file mode 100644 index 000000000..c4238403b --- /dev/null +++ b/ai-ml/bionemo/versions.tf @@ -0,0 +1,37 @@ +terraform { + required_version = ">= 1.0.0" + + required_providers { + aws = { + source = "hashicorp/aws" + version = ">= 3.72" + } + kubernetes = { + source = "hashicorp/kubernetes" + version = ">= 2.10" + } + helm = { + source = "hashicorp/helm" + version = ">= 2.4.1" + } + random = { + source = "hashicorp/random" + version = "3.3.2" + } + kubectl = { + source = "gavinbunney/kubectl" + version = ">= 1.14" + } + http = { + source = "hashicorp/http" + version = ">= 3.3" + } + } + + # ## Used for end-to-end testing on project; update to suit your needs + # backend "s3" { + # bucket = "doeks-github-actions-e2e-test-state" + # region = "us-west-2" + # key = "e2e/bionemo/terraform.tfstate" + # } +} diff --git a/ai-ml/bionemo/vpc.tf b/ai-ml/bionemo/vpc.tf new file mode 100644 index 000000000..f63ccbe0c --- /dev/null +++ b/ai-ml/bionemo/vpc.tf @@ -0,0 +1,57 @@ +locals { + # Routable Private subnets only for Private NAT Gateway -> Transit Gateway -> Second VPC for overlapping CIDRs + # e.g., var.vpc_cidr = "10.1.0.0/21" => output: ["10.1.0.0/24", "10.1.1.0/24"] => 256-2 = 254 usable IPs per subnet/AZ + private_subnets = [for k, v in local.azs : cidrsubnet(var.vpc_cidr, 3, k)] + # Routable Public subnets with NAT Gateway and Internet Gateway + # e.g., var.vpc_cidr = "10.1.0.0/21" => output: ["10.1.2.0/26", "10.1.2.64/26"] => 64-2 = 62 usable IPs per subnet/AZ + public_subnets = [for k, v in local.azs : cidrsubnet(var.vpc_cidr, 5, k + 8)] + + database_private_subnets = [for k, v in local.azs : cidrsubnet(var.vpc_cidr, 3, k + 5)] + # RFC6598 range 100.64.0.0/16 for EKS Data Plane for two subnets(32768 IPs per Subnet) across two AZs for EKS Control Plane ENI + Nodes + Pods + # e.g., var.secondary_cidr_blocks = "100.64.0.0/16" => output: ["100.64.0.0/17", "100.64.128.0/17"] => 32768-2 = 32766 usable IPs per subnet/AZ + secondary_ip_range_private_subnets = [for k, v in local.azs : cidrsubnet(element(var.secondary_cidr_blocks, 0), 1, k)] +} + +#--------------------------------------------------------------- +# VPC +#--------------------------------------------------------------- + +module "vpc" { + source = "terraform-aws-modules/vpc/aws" + version = "~> 5.0" + + name = local.name + cidr = local.vpc_cidr + azs = local.azs + + # Secondary CIDR block attached to VPC for EKS Control Plane ENI + Nodes + Pods + secondary_cidr_blocks = var.secondary_cidr_blocks + + # Two private Subnets with RFC1918 private IPv4 address range for Private NAT + NLB + private_subnets = concat(local.private_subnets, local.secondary_ip_range_private_subnets) + + # ------------------------------ + # Optional Public Subnets for NAT and IGW for PoC/Dev/Test environments + # Public Subnets can be disabled while deploying to Production and use Private NAT + TGW + public_subnets = local.public_subnets + + # ------------------------------ + # Private Subnets for MLflow backend store + database_subnets = local.database_private_subnets + create_database_subnet_group = true + create_database_subnet_route_table = true + + enable_nat_gateway = true + single_nat_gateway = true + enable_dns_hostnames = true + + public_subnet_tags = { + "kubernetes.io/role/elb" = 1 + } + + private_subnet_tags = { + "kubernetes.io/role/internal-elb" = 1 + } + + tags = local.tags +} diff --git a/website/docs/gen-ai/training/Llama2.md b/website/docs/gen-ai/training/Llama2.md index adde58305..666135746 100644 --- a/website/docs/gen-ai/training/Llama2.md +++ b/website/docs/gen-ai/training/Llama2.md @@ -1,6 +1,6 @@ --- title: Llama-2 on Trainium -sidebar_position: 1 +sidebar_position: 2 --- import CollapsibleContent from '../../../src/components/CollapsibleContent'; diff --git a/website/docs/gen-ai/training/bionemo.md b/website/docs/gen-ai/training/bionemo.md new file mode 100644 index 000000000..0e0a7e574 --- /dev/null +++ b/website/docs/gen-ai/training/bionemo.md @@ -0,0 +1,344 @@ +--- +sidebar_position: 3 +sidebar_label: BioNeMo on EKS +--- +import CollapsibleContent from '../../../src/components/CollapsibleContent'; + +# BioNeMo on EKS + +:::caution +This blueprint should be considered as experimental and should only be used for proof of concept. +::: + + +## Introduction + +[NVIDIA BioNeMo](https://www.nvidia.com/en-us/clara/bionemo/) is a generative AI platform for drug discovery that simplifies and accelerates the training of models using your own data and scaling the deployment of models for drug discovery applications. BioNeMo offers the quickest path to both AI model development and deployment, accelerating the journey to AI-powered drug discovery. It has a growing community of users and contributors, and is actively maintained and developed by the NVIDIA. + +Given its containerized nature, BioNeMo finds versatility in deployment across various environments such as Amazon Sagemaker, AWS ParallelCluster, Amazon ECS, and Amazon EKS. This solution, however, zeroes in on the specific deployment of BioNeMo on Amazon EKS. + +*Source: https://blogs.nvidia.com/blog/bionemo-on-aws-generative-ai-drug-discovery/* + +## Deploying BioNeMo on Kubernetes + +This blueprint leverages three major components for its functionality. The NVIDIA Device Plugin facilitates GPU usage, FSx stores training data, and the Kubeflow Training Operator manages the actual training process. + +1) [**Kubeflow Training Operator**](https://www.kubeflow.org/docs/components/training/) +2) [**NVIDIA Device Plugin**](https://github.com/NVIDIA/k8s-device-plugin) +3) [**FSx for Lustre CSI Driver**](https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html) + + +In this blueprint, we will deploy an Amazon EKS cluster and execute both a data preparation job and a distributed model training job. + +Pre-requisites}> + +Ensure that you have installed the following tools on your local machine or the machine you are using to deploy the Terraform blueprint, such as Mac, Windows, or Cloud9 IDE: + +1. [aws cli](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) +2. [kubectl](https://Kubernetes.io/docs/tasks/tools/) +3. [terraform](https://learn.hashicorp.com/tutorials/terraform/install-cli) + + + +Deploy the blueprint}> + +#### Clone the repository + +First, clone the repository containing the necessary files for deploying the blueprint. Use the following command in your terminal: + +```bash +git clone https://github.com/awslabs/data-on-eks.git +``` + +#### Initialize Terraform + +Navigate into the directory specific to the blueprint you want to deploy. In this case, we're interested in the BioNeMo blueprint, so navigate to the appropriate directory using the terminal: + +```bash +cd data-on-eks/ai-ml/bionemo +``` + +#### Run the install script + +Use the provided helper script `install.sh` to run the terraform init and apply commands. By default the script deploys EKS cluster to `us-west-2` region. Update `variables.tf` to change the region. This is also the time to update any other input variables or make any other changes to the terraform template. + + +```bash +./install .sh +``` + +Update local kubeconfig so we can access kubernetes cluster + +```bash +aws eks update-kubeconfig --name bionemo-on-eks #or whatever you used for EKS cluster name +``` + +Since there is no helm chart for Training Operator, we have to manually install the package. If a helm chart gets build by training-operator team, we +will incorporate it to the terraform-aws-eks-data-addons repository. + +#### Install Kubeflow Training Operator +```bash +kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0" +``` + + + +Verify Deployment}> + +First, lets verify that we have worker nodes running in the cluster. + +```bash +kubectl get nodes +``` +```bash +NAME STATUS ROLES AGE VERSION +ip-100-64-180-114.us-west-2.compute.internal Ready 17m v1.29.0-eks-5e0fdde +ip-100-64-19-70.us-west-2.compute.internal Ready 16m v1.29.0-eks-5e0fdde +ip-100-64-205-93.us-west-2.compute.internal Ready 17m v1.29.0-eks-5e0fdde +ip-100-64-235-15.us-west-2.compute.internal Ready 16m v1.29.0-eks-5e0fdde +ip-100-64-34-75.us-west-2.compute.internal Ready 17m v1.29.0-eks-5e0fdde +... +``` + +Next, lets verify all the pods are running. + +```bash +kubectl get pods -A +``` + +```bash +NAMESPACE NAME READY STATUS RESTARTS AGE +amazon-cloudwatch aws-cloudwatch-metrics-4g9dm 1/1 Running 0 15m +amazon-cloudwatch aws-cloudwatch-metrics-4ktjc 1/1 Running 0 15m +amazon-cloudwatch aws-cloudwatch-metrics-5hj96 1/1 Running 0 15m +amazon-cloudwatch aws-cloudwatch-metrics-k84p5 1/1 Running 0 15m +amazon-cloudwatch aws-cloudwatch-metrics-rkt8f 1/1 Running 0 15m +kube-system aws-node-4pnpr 2/2 Running 0 15m +kube-system aws-node-jrksf 2/2 Running 0 15m +kube-system aws-node-lv7vn 2/2 Running 0 15m +kube-system aws-node-q7cp9 2/2 Running 0 14m +kube-system aws-node-zplq5 2/2 Running 0 14m +kube-system coredns-86bd649884-8kwn9 1/1 Running 0 15m +kube-system coredns-86bd649884-bvltg 1/1 Running 0 15m +kube-system fsx-csi-controller-85d9ddfbff-7hgmn 4/4 Running 0 16m +kube-system fsx-csi-controller-85d9ddfbff-lp28p 4/4 Running 0 16m +kube-system fsx-csi-node-2tfgq 3/3 Running 0 16m +kube-system fsx-csi-node-jtdd6 3/3 Running 0 16m +kube-system fsx-csi-node-kj6tz 3/3 Running 0 16m +kube-system fsx-csi-node-pwp5x 3/3 Running 0 16m +kube-system fsx-csi-node-rl59r 3/3 Running 0 16m +kube-system kube-proxy-5nbms 1/1 Running 0 15m +kube-system kube-proxy-dzjxz 1/1 Running 0 15m +kube-system kube-proxy-j9bnp 1/1 Running 0 15m +kube-system kube-proxy-p8xwq 1/1 Running 0 15m +kube-system kube-proxy-pgqbb 1/1 Running 0 15m +kubeflow training-operator-64c768746c-l5fbq 1/1 Running 0 24s +nvidia-device-plugin neuron-device-plugin-gpu-feature-discovery-g4xx9 1/1 Running 0 15m +nvidia-device-plugin neuron-device-plugin-gpu-feature-discovery-ggwjm 1/1 Running 0 15m +nvidia-device-plugin neuron-device-plugin-node-feature-discovery-master-68bc46c9dbw8 1/1 Running 0 16m +nvidia-device-plugin neuron-device-plugin-node-feature-discovery-worker-6b94s 1/1 Running 0 16m +nvidia-device-plugin neuron-device-plugin-node-feature-discovery-worker-7jzsn 1/1 Running 0 16m +nvidia-device-plugin neuron-device-plugin-node-feature-discovery-worker-kt9fd 1/1 Running 0 16m +nvidia-device-plugin neuron-device-plugin-node-feature-discovery-worker-vlpdp 1/1 Running 0 16m +nvidia-device-plugin neuron-device-plugin-node-feature-discovery-worker-wwnk6 1/1 Running 0 16m +nvidia-device-plugin neuron-device-plugin-nvidia-device-plugin-mslxx 1/1 Running 0 15m +nvidia-device-plugin neuron-device-plugin-nvidia-device-plugin-phw2j 1/1 Running 0 15m +... +``` +:::info +Make sure training-operator, nvidia-device-plugin and fsx-csi-controller pods are running and healthy. + +::: + + + +### Run BioNeMo Training jobs + +Once you've ensured that all components are functioning properly, you can proceed to submit jobs to your clusters. + +#### Step1: Initiate the Uniref50 Data Preparation Task + +The first task, named the `uniref50-job.yaml`, involves downloading and partitioning the data to enhance processing efficiency. This task specifically retrieves the `uniref50 dataset` and organizes it within the FSx for Lustre Filesystem. This structured layout is designed for training, testing, and validation purposes. You can learn more about the uniref dataset [here](https://www.uniprot.org/help/uniref). + +To execute this job, navigate to the `examples\training` directory and deploy the `uniref50-job.yaml` manifest using the following commands: + +```bash +cd examples/training +kubectl apply -f uniref50-job.yaml +``` + +:::info + +It's important to note that this task requires a significant amount of time, typically ranging from 50 to 60 hours. + +::: + +Run the below command to look for the pod `uniref50-download-*` + +```bash +kubectl get pods +``` + +To verify its progress, examine the logs generated by the corresponding pod: + +```bash +kubectl logs uniref50-download-xnz42 + +[NeMo I 2024-02-26 23:02:20 preprocess:289] Download and preprocess of UniRef50 data does not currently use GPU. Workstation or CPU-only instance recommended. +[NeMo I 2024-02-26 23:02:20 preprocess:115] Data processing can take an hour or more depending on system resources. +[NeMo I 2024-02-26 23:02:20 preprocess:117] Downloading file from https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz... +[NeMo I 2024-02-26 23:02:20 preprocess:75] Downloading file to /fsx/raw/uniref50.fasta.gz... +[NeMo I 2024-02-26 23:08:33 preprocess:89] Extracting file to /fsx/raw/uniref50.fasta... +[NeMo I 2024-02-26 23:12:46 preprocess:311] UniRef50 data processing complete. +[NeMo I 2024-02-26 23:12:46 preprocess:313] Indexing UniRef50 dataset. +[NeMo I 2024-02-26 23:16:21 preprocess:319] Writing processed dataset files to /fsx/processed... +[NeMo I 2024-02-26 23:16:21 preprocess:255] Creating train split... +``` + + +After finishing this task, the processed dataset will be saved in the `/fsx/processed` directory. Once this is done, we can move forward and start the `pre-training` job by running the following command: + +Following this, we can proceed to execute the pre-training job by running: + +In this PyTorchJob YAML, the command `python3 -m torch.distributed.run` plays a crucial role in orchestrating **distributed training** across multiple worker pods in your Kubernetes cluster. + +It handles the following tasks: + +1. Initializes a distributed backend (e.g., c10d, NCCL) for communication between worker processes.In our example it's using c10d. This is a commonly used distributed backend in PyTorch that can leverage different communication mechanisms like TCP or Infiniband depending on your environment. +2. Sets up environment variables to enable distributed training within your training script. +3. Launches your training script on all worker pods, ensuring each process participates in the distributed training. + + +```bash +cd examples/training +kubectl apply -f esm1nv_pretrain-job.yaml +``` + +Run the below command to look for the pods `esm1nv-pretraining-worker-*` + +```bash +kubectl get pods +``` + +```bash +NAME READY STATUS RESTARTS AGE +esm1nv-pretraining-worker-0 1/1 Running 0 13m +esm1nv-pretraining-worker-1 1/1 Running 0 13m +esm1nv-pretraining-worker-10 1/1 Running 0 13m +esm1nv-pretraining-worker-11 1/1 Running 0 13m +esm1nv-pretraining-worker-12 1/1 Running 0 13m +esm1nv-pretraining-worker-13 1/1 Running 0 13m +esm1nv-pretraining-worker-14 1/1 Running 0 13m +esm1nv-pretraining-worker-15 1/1 Running 0 13m +esm1nv-pretraining-worker-2 1/1 Running 0 13m +esm1nv-pretraining-worker-3 1/1 Running 0 13m +esm1nv-pretraining-worker-4 1/1 Running 0 13m +esm1nv-pretraining-worker-5 1/1 Running 0 13m +esm1nv-pretraining-worker-6 1/1 Running 0 13m +esm1nv-pretraining-worker-7 1/1 Running 0 13m +esm1nv-pretraining-worker-8 1/1 Running 0 13m +esm1nv-pretraining-worker-9 1/1 Running 0 13m +``` + +We should see 16 pods running. We chose p3.16xlarge instances and each instance has 8 GPUs. In the pod definition we specified each job will leverage 1 gpu. +Since we set up "nprocPerNode" to "8", each node will be responsible for 8 jobs. Since we have 2 nodes, total of 16 pods will start. For more details around distributed pytorch training see [pytorch docs](https://pytorch.org/docs/stable/distributed.html). + +:::info +This training job can run for at least 3-4 days with 2 p3.16xlarge nodes. +::: + +This configuration utilizes Kubeflow's PyTorch training Custom Resource Definition (CRD). Within this manifest, various parameters are available for customization. For detailed insights into each parameter and guidance on fine-tuning, you can refer to [BioNeMo's documentation](https://docs.nvidia.com/bionemo-framework/latest/notebooks/model_training_esm1nv.html). + +:::info +Based on the Kubeflow training operator documentation, if you do not specify the master replica pod explicitly, the first worker replica pod(worker-0) will be treated as the master pod. +::: + +To track the progress of this process, follow these steps: + +```bash +kubectl logs esm1nv-pretraining-worker-0 + +Epoch 0: 7%|▋ | 73017/1017679 [00:38<08:12, 1918.0% +``` + +Additionally, to monitor the usage of the GPUs, you have the option to connect to your nodes through the EC2 console using Session Manager and run `nvidia-smi` command. If you want to have a more robust observability, you can refer to the [DCGM Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). + + +```bash +sh-4.2$ nvidia-smi +Thu Mar 7 16:31:01 2024 ++---------------------------------------------------------------------------------------+ +| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | +|-----------------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+======================+======================| +| 0 Tesla V100-SXM2-16GB On | 00000000:00:17.0 Off | 0 | +| N/A 51C P0 80W / 300W | 3087MiB / 16384MiB | 100% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +| 1 Tesla V100-SXM2-16GB On | 00000000:00:18.0 Off | 0 | +| N/A 44C P0 76W / 300W | 3085MiB / 16384MiB | 100% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +| 2 Tesla V100-SXM2-16GB On | 00000000:00:19.0 Off | 0 | +| N/A 43C P0 77W / 300W | 3085MiB / 16384MiB | 100% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +| 3 Tesla V100-SXM2-16GB On | 00000000:00:1A.0 Off | 0 | +| N/A 52C P0 77W / 300W | 3085MiB / 16384MiB | 100% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +| 4 Tesla V100-SXM2-16GB On | 00000000:00:1B.0 Off | 0 | +| N/A 49C P0 79W / 300W | 3085MiB / 16384MiB | 100% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +| 5 Tesla V100-SXM2-16GB On | 00000000:00:1C.0 Off | 0 | +| N/A 44C P0 74W / 300W | 3085MiB / 16384MiB | 100% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +| 6 Tesla V100-SXM2-16GB On | 00000000:00:1D.0 Off | 0 | +| N/A 44C P0 78W / 300W | 3085MiB / 16384MiB | 100% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +| 7 Tesla V100-SXM2-16GB On | 00000000:00:1E.0 Off | 0 | +| N/A 50C P0 79W / 300W | 3085MiB / 16384MiB | 100% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ + ++---------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=======================================================================================| +| 0 N/A N/A 1552275 C /usr/bin/python3 3084MiB | +| 1 N/A N/A 1552277 C /usr/bin/python3 3082MiB | +| 2 N/A N/A 1552278 C /usr/bin/python3 3082MiB | +| 3 N/A N/A 1552280 C /usr/bin/python3 3082MiB | +| 4 N/A N/A 1552279 C /usr/bin/python3 3082MiB | +| 5 N/A N/A 1552274 C /usr/bin/python3 3082MiB | +| 6 N/A N/A 1552273 C /usr/bin/python3 3082MiB | +| 7 N/A N/A 1552276 C /usr/bin/python3 3082MiB | ++---------------------------------------------------------------------------------------+ +``` + + +#### Benefits of Distributed Training: + +By distributing the training workload across multiple GPUs in your worker pods, you can train large models faster by leveraging the combined computational power of all GPUs. Handle larger datasets that might not fit on a single GPU's memory. + +#### Conclusion +BioNeMo stands as a formidable generative AI tool tailored for the realm of drug discovery. In this illustrative example, we took the initiative to pretrain a custom model entirely from scratch, utilizing the extensive uniref50 dataset. However, it's worth noting that BioNeMo offers the flexibility to expedite the process by employing pretrained models directly [provided by NVidia](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/containers/bionemo-framework). This alternative approach can significantly streamline your workflow while maintaining the robust capabilities of the BioNeMo framework. + + +Cleanup}> + +Use the provided helper script `cleanup.sh` to tear down EKS cluster and other AWS resources. + +```bash +cd ../../ +./cleanup.sh +``` + +