Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Incorrect command to provide Linux permission on the AWS Trainium on EKS Blueprint #533 #542

Merged
merged 2 commits into from
May 29, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions website/docs/blueprints/ai-ml/trainium.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ If you are executing this script on a Cloud9 IDE/EC2 instance different from the

```bash
cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain
chomd +x 1-bert-pretrain-build-image.sh
chmod +x 1-bert-pretrain-build-image.sh
./1-bert-pretrain-build-image.sh
```

Expand All @@ -160,13 +160,13 @@ Login to AWS Console and verify the ECR repo(`<YOUR_ACCOUNT_ID>.dkr.ecr.<REGION>

#### Step2: Copy WikiCorpus pre-training dataset for BERT model to FSx for Lustre filesystem

In this step, we make it easy to transfer the WikiCorpus pre-training dataset, which is crucial for training the BERT model in distributed mode by multiple Trainium instances, to the FSx for Lustre filesystem. To achieve this, we will login to `aws-cli-cmd-shell` pod which includes an AWS CLI container, providing access to the filesystem.
In this step, we make it easy to transfer the WikiCorpus pre-training dataset, which is crucial for training the BERT model in distributed mode by multiple Trainium instances, to the FSx for Lustre filesystem. To achieve this, we will login to `cmd-shell` pod which includes an AWS CLI container, providing access to the filesystem.

Once you're inside the container, Copy the WikiCorpus dataset from S3 bucket (`s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar`). The dataset is then unpacked, giving you access to its contents, ready for use in the subsequent BERT model pre-training process.


```bash
kubectl exec -i -t -n default aws-cli-cmd-shell -c app -- sh -c "clear; (bash || ash || sh)"
kubectl exec -i -t -n default cmd-shell -c app -- sh -c "clear; (bash || ash || sh)"

# Once logged into the container
yum install tar
Expand All @@ -184,7 +184,7 @@ Execute the following commands.This script prompts the user to configure their k

```bash
cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain
chomd +x 2-bert-pretrain-precompile.sh
chmod +x 2-bert-pretrain-precompile.sh
./2-bert-pretrain-precompile.sh
```

Expand Down Expand Up @@ -215,7 +215,7 @@ We are now in the final step of training the BERT-large model with WikiCorpus da

```bash
cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain
chomd +x 3-bert-pretrain.sh
chmod +x 3-bert-pretrain.sh
./3-bert-pretrain.sh
```

Expand Down
Loading