Skip to content

Latest commit

 

History

History
157 lines (109 loc) · 4.23 KB

Multi-Gpu.md

File metadata and controls

157 lines (109 loc) · 4.23 KB

Ray Project Setup - Multi-GPU Training

This project uses Ray to manage VM instances for distributed computing. In this setup, we'll use a head node to execute a job on larger VM instances with powerful CPUs and multiple GPUs. Tested and working using Python 3.11.

Getting Started

  1. Clone the repository:

    git clone ray-distributed-compute.git
    cd ray-distributed-compute
  2. Install Ray:

    pip install ray
  3. Configure AWS CLI:

    aws configure
  4. Find Your AWS Account Number:

    aws sts get-caller-identity --query Account --output text

    Note down your AWS account number, as you will need it for the next steps.

  5. Create an IAM role with full S3 access:

    aws iam create-role --role-name ray-s3-fullaccess --assume-role-policy-document file://trust-policy.json
  6. Create an S3 bucket:

    aws s3api create-bucket --bucket ray-bucket-model-output --region eu-central-1 --create-bucket-configuration LocationConstraint=eu-central-1
  7. Attach the S3 full access policy to the role:

    aws iam attach-role-policy --role-name ray-s3-fullaccess --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
  8. Create an instance profile:

    aws iam create-instance-profile --instance-profile-name ray-s3-instance-profile
  9. Add the role to the instance profile:

    aws iam add-role-to-instance-profile --instance-profile-name ray-s3-instance-profile --role-name ray-s3-fullaccess
  10. Create a policy to allow iam:PassRole for ray-autoscaler-v1: Replace <your-account-id> with your AWS account number.

    aws iam create-policy \
        --policy-name PassRolePolicy \
        --policy-document '{
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": "iam:PassRole",
                    "Resource": "arn:aws:iam::<your-account-id>:role/ray-s3-fullaccess"
                }
            ]
        }'
  11. Attach the PassRolePolicy to the ray-autoscaler-v1 role:

    aws iam attach-role-policy \
        --role-name ray-autoscaler-v1 \
        --policy-arn arn:aws:iam::<your-account-id>:policy/PassRolePolicy
  12. Retrieve the ARN for ray-s3-instance-profile:

    aws iam list-instance-profiles-for-role --role-name ray-s3-fullaccess --query 'InstanceProfiles[0].Arn' --output text

    Note down the retrieved ARN for use in the next steps.

  13. Update the YAML configuration:

    Open your raycluster.yaml file and replace the placeholder with the ARN you retrieved:

    ray.worker.default:
      resources:
        CPU: 1
        resources: 15
      node_config:
        ImageId: ami-07652eda1fbad7432
        InstanceType: p3.2xlarge
        IamInstanceProfile:
          Arn: arn:aws:iam::<your-account-id>:instance-profile/ray-s3-instance-profile
  14. Start the Ray cluster:

    ray up raycluster.yaml
  15. Access the Ray dashboard:

    ray dashboard raycluster.yaml

Submitting the Multi-GPU Training Job

After completing all the necessary setup steps, submit the job using the following command, replacing the script with ray-train-multiple-gpu.py:

ray job submit --address http://localhost:8265 --working-dir . -- python3 ray-train-multiple-gpu.py

Retrieving the Output

To find the output:

  1. Connect to the head node via SSH:

    ray attach raycluster.yaml
  2. Navigate to the results directory:

    ls
    cd ray_results

    Here, you'll find the results of the training job.

Overview

Ray is a distributed computing framework that allows you to easily scale your applications across multiple machines. In this setup, you'll use Ray to manage a head node and a larger VM instance with a powerful CPU and multiple GPUs, leveraging their respective hardware capabilities to perform computational tasks efficiently.

For detailed documentation on Ray, visit the Ray documentation.