From abf15bc0556028636e5dfb33a92b2ef0a9c170e8 Mon Sep 17 00:00:00 2001 From: EC2 Default User Date: Wed, 29 May 2024 07:50:33 +0000 Subject: [PATCH 1/2] Fixed issue: Incorrect command to provide Linux permission on the AWS Trainium on EKS Blueprint #533 --- website/docs/blueprints/ai-ml/trainium.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/website/docs/blueprints/ai-ml/trainium.md b/website/docs/blueprints/ai-ml/trainium.md index 445768c37..0981cd637 100644 --- a/website/docs/blueprints/ai-ml/trainium.md +++ b/website/docs/blueprints/ai-ml/trainium.md @@ -134,7 +134,7 @@ If you are executing this script on a Cloud9 IDE/EC2 instance different from the ```bash cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain -chomd +x 1-bert-pretrain-build-image.sh +chmod +x 1-bert-pretrain-build-image.sh ./1-bert-pretrain-build-image.sh ``` @@ -184,7 +184,7 @@ Execute the following commands.This script prompts the user to configure their k ```bash cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain -chomd +x 2-bert-pretrain-precompile.sh +chmod +x 2-bert-pretrain-precompile.sh ./2-bert-pretrain-precompile.sh ``` @@ -215,7 +215,7 @@ We are now in the final step of training the BERT-large model with WikiCorpus da ```bash cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain -chomd +x 3-bert-pretrain.sh +chmod +x 3-bert-pretrain.sh ./3-bert-pretrain.sh ``` From 8ce1db66c7196e5d5e49c4100d19f0abbd79318c Mon Sep 17 00:00:00 2001 From: EC2 Default User Date: Wed, 29 May 2024 09:03:09 +0000 Subject: [PATCH 2/2] Fixed issue: Incorrect POD name aws-cli-cmd-shell given in the instructions. #543 --- website/docs/blueprints/ai-ml/trainium.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/blueprints/ai-ml/trainium.md b/website/docs/blueprints/ai-ml/trainium.md index 0981cd637..4ae087b79 100644 --- a/website/docs/blueprints/ai-ml/trainium.md +++ b/website/docs/blueprints/ai-ml/trainium.md @@ -160,13 +160,13 @@ Login to AWS Console and verify the ECR repo(`.dkr.ecr. #### Step2: Copy WikiCorpus pre-training dataset for BERT model to FSx for Lustre filesystem -In this step, we make it easy to transfer the WikiCorpus pre-training dataset, which is crucial for training the BERT model in distributed mode by multiple Trainium instances, to the FSx for Lustre filesystem. To achieve this, we will login to `aws-cli-cmd-shell` pod which includes an AWS CLI container, providing access to the filesystem. +In this step, we make it easy to transfer the WikiCorpus pre-training dataset, which is crucial for training the BERT model in distributed mode by multiple Trainium instances, to the FSx for Lustre filesystem. To achieve this, we will login to `cmd-shell` pod which includes an AWS CLI container, providing access to the filesystem. Once you're inside the container, Copy the WikiCorpus dataset from S3 bucket (`s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar`). The dataset is then unpacked, giving you access to its contents, ready for use in the subsequent BERT model pre-training process. ```bash -kubectl exec -i -t -n default aws-cli-cmd-shell -c app -- sh -c "clear; (bash || ash || sh)" +kubectl exec -i -t -n default cmd-shell -c app -- sh -c "clear; (bash || ash || sh)" # Once logged into the container yum install tar