Ray Project Setup - Multi-GPU Training

This project uses Ray to manage VM instances for distributed computing. In this setup, we'll use a head node to execute a job on larger VM instances with powerful CPUs and multiple GPUs. Tested and working using Python 3.11.

Getting Started

Clone the repository:

git clone ray-distributed-compute.git
cd ray-distributed-compute

Install Ray:
```
pip install ray
```
Configure AWS CLI:
```
aws configure
```
Find Your AWS Account Number:
```
aws sts get-caller-identity --query Account --output text
```
Note down your AWS account number, as you will need it for the next steps.

Create an IAM role with full S3 access:

aws iam create-role --role-name ray-s3-fullaccess --assume-role-policy-document file://trust-policy.json

Create an S3 bucket:

aws s3api create-bucket --bucket ray-bucket-model-output --region eu-central-1 --create-bucket-configuration LocationConstraint=eu-central-1

Attach the S3 full access policy to the role:

aws iam attach-role-policy --role-name ray-s3-fullaccess --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

Create an instance profile:

aws iam create-instance-profile --instance-profile-name ray-s3-instance-profile

Add the role to the instance profile:

aws iam add-role-to-instance-profile --instance-profile-name ray-s3-instance-profile --role-name ray-s3-fullaccess

Create a policy to allow iam:PassRole for ray-autoscaler-v1: Replace <your-account-id> with your AWS account number.

aws iam create-policy \
    --policy-name PassRolePolicy \
    --policy-document '{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "iam:PassRole",
                "Resource": "arn:aws:iam::<your-account-id>:role/ray-s3-fullaccess"
            }
        ]
    }'

Attach the PassRolePolicy to the ray-autoscaler-v1 role:

aws iam attach-role-policy \
    --role-name ray-autoscaler-v1 \
    --policy-arn arn:aws:iam::<your-account-id>:policy/PassRolePolicy

Retrieve the ARN for ray-s3-instance-profile:

aws iam list-instance-profiles-for-role --role-name ray-s3-fullaccess --query 'InstanceProfiles[0].Arn' --output text

Note down the retrieved ARN for use in the next steps.

Update the YAML configuration:

Open your raycluster.yaml file and replace the placeholder with the ARN you retrieved:

ray.worker.default:
  resources:
    CPU: 1
    resources: 15
  node_config:
    ImageId: ami-07652eda1fbad7432
    InstanceType: p3.2xlarge
    IamInstanceProfile:
      Arn: arn:aws:iam::<your-account-id>:instance-profile/ray-s3-instance-profile

Start the Ray cluster:
```
ray up raycluster.yaml
```
Access the Ray dashboard:
```
ray dashboard raycluster.yaml
```

Submitting the Multi-GPU Training Job

After completing all the necessary setup steps, submit the job using the following command, replacing the script with ray-train-multiple-gpu.py:

ray job submit --address http://localhost:8265 --working-dir . -- python3 ray-train-multiple-gpu.py

Retrieving the Output

To find the output:

Connect to the head node via SSH:
```
ray attach raycluster.yaml
```
Navigate to the results directory:
```
ls
cd ray_results
```
Here, you'll find the results of the training job.

Overview

Ray is a distributed computing framework that allows you to easily scale your applications across multiple machines. In this setup, you'll use Ray to manage a head node and a larger VM instance with a powerful CPU and multiple GPUs, leveraging their respective hardware capabilities to perform computational tasks efficiently.

For detailed documentation on Ray, visit the Ray documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Gpu.md

Multi-Gpu.md

Ray Project Setup - Multi-GPU Training

Getting Started

Submitting the Multi-GPU Training Job

Retrieving the Output

Overview

Files

Multi-Gpu.md

Latest commit

History

Multi-Gpu.md

File metadata and controls

Ray Project Setup - Multi-GPU Training

Getting Started

Submitting the Multi-GPU Training Job

Retrieving the Output

Overview