This project uses Ray to manage AWS EC2 instances for distributed computing. Following up on our previous work on Ray, we will be utilizing DeepSpeed for its ZERO 3 optimizer to fine-tune on a Ray cluster. Tested and working using Python 3.11.
-
Clone the repository:
git clone ray-distributed-compute.git cd ray-distributed-compute
-
Install Ray:
pip install ray
-
Install DeepSpeed:
pip install deepspeed
-
Configure AWS CLI:
aws configure
-
Find Your AWS Account Number:
aws sts get-caller-identity --query Account --output text
Note down your AWS account number, as you will need it for the next steps.
-
Create an IAM role with full S3 access:
aws iam create-role --role-name ray-s3-fullaccess --assume-role-policy-document file://trust-policy.json
-
Create an S3 bucket:
aws s3api create-bucket --bucket ray-bucket-model-output --region eu-central-1 --create-bucket-configuration LocationConstraint=eu-central-1
-
Attach the S3 full access policy to the role:
aws iam attach-role-policy --role-name ray-s3-fullaccess --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
-
Create an instance profile:
aws iam create-instance-profile --instance-profile-name ray-s3-instance-profile
-
Add the role to the instance profile:
aws iam add-role-to-instance-profile --instance-profile-name ray-s3-instance-profile --role-name ray-s3-fullaccess
-
Create a policy to allow
iam:PassRole
forray-autoscaler-v1
: Replace<your-account-id>
with your AWS account number.aws iam create-policy \ --policy-name PassRolePolicy \ --policy-document '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::<your-account-id>:role/ray-s3-fullaccess" } ] }'
-
Attach the
PassRolePolicy
to theray-autoscaler-v1
role:aws iam attach-role-policy \ --role-name ray-autoscaler-v1 \ --policy-arn arn:aws:iam::<your-account-id>:policy/PassRolePolicy
-
Retrieve the ARN for
ray-s3-instance-profile
:aws iam list-instance-profiles-for-role --role-name ray-s3-fullaccess --query 'InstanceProfiles[0].Arn' --output text
Note down the retrieved ARN for use in the next steps.
-
Update the YAML configuration:
Open your
raycluster.yaml
file and replace the placeholder with the ARN you retrieved:ray.worker.default: resources: CPU: 1 resources: 15 node_config: ImageId: ami-07652eda1fbad7432 InstanceType: p3.2xlarge IamInstanceProfile: Arn: arn:aws:iam::<your-account-id>:instance-profile/ray-s3-instance-profile
-
Start the Ray cluster:
ray up raycluster.yaml
-
Access the Ray dashboard:
ray dashboard raycluster.yaml
-
Submit a Ray job:
Open a new terminal window and navigate to your project directory:
cd <project-directory>
Submit the Ray job:
ray job submit --address http://localhost:8265 --working-dir . -- python3 main.py
-
Check the S3 bucket:
When the job finishes running, head over to the specified S3 bucket (
ray-bucket-model-output
) where you should find the trained model.
Ray is a distributed computing framework that allows you to easily scale your applications across multiple machines. In this setup, you'll use Ray to manage a head node and large VM instance's with powerful CPU's and GPU's, leveraging their respective hardware capabilities to perform computational tasks efficiently. Additionally, using DeepSpeed with the ZERO 3 optimizer will enhance your fine-tuning process.
For detailed documentation on Ray, visit the Ray documentation. For more on DeepSpeed, check out the DeepSpeed documentation.
Made with ❤️ by datamax.ai.