This project uses Ray to manage VM instances for distributed computing. In this setup, we'll use a head node to execute a job on larger VM instances with powerful CPUs and multiple GPUs. Tested and working using Python 3.11.
-
Clone the repository:
git clone ray-distributed-compute.git cd ray-distributed-compute
-
Install Ray:
pip install ray
-
Configure AWS CLI:
aws configure
-
Find Your AWS Account Number:
aws sts get-caller-identity --query Account --output text
Note down your AWS account number, as you will need it for the next steps.
-
Create an IAM role with full S3 access:
aws iam create-role --role-name ray-s3-fullaccess --assume-role-policy-document file://trust-policy.json
-
Create an S3 bucket:
aws s3api create-bucket --bucket ray-bucket-model-output --region eu-central-1 --create-bucket-configuration LocationConstraint=eu-central-1
-
Attach the S3 full access policy to the role:
aws iam attach-role-policy --role-name ray-s3-fullaccess --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
-
Create an instance profile:
aws iam create-instance-profile --instance-profile-name ray-s3-instance-profile
-
Add the role to the instance profile:
aws iam add-role-to-instance-profile --instance-profile-name ray-s3-instance-profile --role-name ray-s3-fullaccess
-
Create a policy to allow
iam:PassRole
forray-autoscaler-v1
: Replace<your-account-id>
with your AWS account number.aws iam create-policy \ --policy-name PassRolePolicy \ --policy-document '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::<your-account-id>:role/ray-s3-fullaccess" } ] }'
-
Attach the
PassRolePolicy
to theray-autoscaler-v1
role:aws iam attach-role-policy \ --role-name ray-autoscaler-v1 \ --policy-arn arn:aws:iam::<your-account-id>:policy/PassRolePolicy
-
Retrieve the ARN for
ray-s3-instance-profile
:aws iam list-instance-profiles-for-role --role-name ray-s3-fullaccess --query 'InstanceProfiles[0].Arn' --output text
Note down the retrieved ARN for use in the next steps.
-
Update the YAML configuration:
Open your
raycluster.yaml
file and replace the placeholder with the ARN you retrieved:ray.worker.default: resources: CPU: 1 resources: 15 node_config: ImageId: ami-07652eda1fbad7432 InstanceType: p3.2xlarge IamInstanceProfile: Arn: arn:aws:iam::<your-account-id>:instance-profile/ray-s3-instance-profile
-
Start the Ray cluster:
ray up raycluster.yaml
-
Access the Ray dashboard:
ray dashboard raycluster.yaml
After completing all the necessary setup steps, submit the job using the following command, replacing the script with ray-train-multiple-gpu.py
:
ray job submit --address http://localhost:8265 --working-dir . -- python3 ray-train-multiple-gpu.py
To find the output:
-
Connect to the head node via SSH:
ray attach raycluster.yaml
-
Navigate to the results directory:
ls cd ray_results
Here, you'll find the results of the training job.
Ray is a distributed computing framework that allows you to easily scale your applications across multiple machines. In this setup, you'll use Ray to manage a head node and a larger VM instance with a powerful CPU and multiple GPUs, leveraging their respective hardware capabilities to perform computational tasks efficiently.
For detailed documentation on Ray, visit the Ray documentation.