Skip to content

Commit

Permalink
[Performance] Add Packer image generation scripts for GCP and AWS (#4068
Browse files Browse the repository at this point in the history
)

* [Performance] Add Packer image generation scripts for GCP and AWS

* Add docker install and tests

* solve nvidia container issue

* Install cuDNN

* [Performance] Scripts to copy/delete AWS images for all regions and add cloud deps (#4073)

* [Performance] Add AWS script to copy images for all regions

* script to delete all AWS images across regions

* Add cloud dependencies to image

---------

Co-authored-by: Yika Luo <[email protected]>
  • Loading branch information
yika-luo and Yika Luo authored Oct 18, 2024
1 parent 5dc70e8 commit 92fd109
Show file tree
Hide file tree
Showing 14 changed files with 680 additions and 5 deletions.
72 changes: 72 additions & 0 deletions sky/clouds/service_catalog/images/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# SkyPilot OS Image Generation Guide

## Prerequisites
You only need to do this once.
1. Install [Packer](https://developer.hashicorp.com/packer/tutorials/aws-get-started/get-started-install-cli)
2. Download plugins used by Packer
```bash
packer init plugins.pkr.hcl
```
3. Setup cloud credentials

## Generate Images
```bash
export CLOUD=gcp # Update this
export TYPE=gpu # Update this
export IMAGE=skypilot-${CLOUD}-${TYPE}-ubuntu
packer build ${IMAGE}.pkr.hcl
```
You will see the image ID after the build is complete.

FYI time to packer build an image:

| Cloud | Type | Approx. Time |
|-------|------|------------------------|
| AWS | GPU | 15 min |
| AWS | CPU | 10 min |
| GCP | GPU | 16 min |
| GCP | CPU | 5 min |

### GCP
```bash
export IMAGE_NAME=skypilot-gcp-cpu-ubuntu-20241011003407 # Update this

# Make image public
export IMAGE_ID=projects/sky-dev-465/global/images/${IMAGE_NAME}
gcloud compute images add-iam-policy-binding ${IMAGE_NAME} --member='allAuthenticatedUsers' --role='roles/compute.imageUser'
```

### AWS
1. Generate images for all regions
```bash
export IMAGE_ID=ami-0b31b24524afa8e47 # Update this

python aws_utils/image_gen.py --image-id ${IMAGE_ID} --processor ${TYPE}
```
2. Add fallback images if any region failed \
Look for "NEED_FALLBACK" in the output `images.csv` and edit. (You can use public [ubuntu images](https://cloud-images.ubuntu.com/locator/ec2/) as fallback.)

## Test Images
1. Minimal GPU test: `sky launch --image ${IMAGE_ID} --gpus=L4:1 --cloud ${CLOUD}` then run `nvidia-smi` in the launched instance.
2. Update the image ID in `sky/clouds/gcp.py` and run the test:
```bash
pytest tests/test_smoke.py::test_minimal --gcp
pytest tests/test_smoke.py::test_huggingface --gcp
pytest tests/test_smoke.py::test_job_queue_with_docker --gcp
pytest tests/test_smoke.py::test_cancel_gcp
```

## Ship Images & Cleanup
Submit a PR to update [`SkyPilot Catalog`](https://github.com/skypilot-org/skypilot-catalog/tree/master/catalogs) then clean up the old images to avoid extra iamge storage fees.

### GCP
1. Example PR: [#86](https://github.com/skypilot-org/skypilot-catalog/pull/86)
2. Go to console and delete old images.

### AWS
1. Copy the old custom image rows from Catalog's existing `images.csv` to a local `images.csv` in this folder.
2. Update Catalog with new images. Example PR: [#89](https://github.com/skypilot-org/skypilot-catalog/pull/89)
3. Delete AMIs across regions by running
```bash
python aws_utils/image_delete.py --tag ${TAG}
```
63 changes: 63 additions & 0 deletions sky/clouds/service_catalog/images/aws_utils/image_delete.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""Delete all images with a given tag and their associated snapshots from images.csv
Example Usage: put images.csv in the same folder as this script and run
python image_delete.py --tag skypilot:custom-gpu-ubuntu-2204
"""

import argparse
import csv
import json
import subprocess

parser = argparse.ArgumentParser(
description='Delete AWS images and their snapshots across regions.')
parser.add_argument('--tag',
required=True,
help='Tag of the image to delete, see tags in images.csv')
args = parser.parse_args()


def get_snapshots(image_id, region):
cmd = f'aws ec2 describe-images --image-ids {image_id} --region {region} --query "Images[*].BlockDeviceMappings[*].Ebs.SnapshotId" --output json'
result = subprocess.run(cmd,
shell=True,
check=True,
capture_output=True,
text=True)
snapshots = json.loads(result.stdout)
return [
snapshot for sublist in snapshots for snapshot in sublist if snapshot
]


def delete_image_and_snapshots(image_id, region):
# Must get snapshots before deleting the image
snapshots = get_snapshots(image_id, region)

# Deregister the image
cmd = f'aws ec2 deregister-image --image-id {image_id} --region {region}'
subprocess.run(cmd, shell=True, check=True)
print(f"Deregistered image {image_id} in region {region}")

# Delete snapshots
for snapshot in snapshots:
cmd = f'aws ec2 delete-snapshot --snapshot-id {snapshot} --region {region}'
subprocess.run(cmd, shell=True, check=True)
print(f'Deleted snapshot {snapshot} in region {region}')


def main():
with open('images.csv', 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['Tag'] == args.tag:
try:
delete_image_and_snapshots(row['ImageId'], row['Region'])
except subprocess.CalledProcessError as e:
print(
f'Failed to delete image {row["ImageId"]} or its snapshots in region {row["Region"]}: {e}'
)


if __name__ == "__main__":
main()
151 changes: 151 additions & 0 deletions sky/clouds/service_catalog/images/aws_utils/image_gen.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
"""Copy SkyPilot AMI to multiple regions, make them public, and generate images.csv
Example Usage:
python aws_image_gen.py --source-image-id ami-00000 --processor gpu
"""

import argparse
import concurrent.futures
import csv
import json
import os
import subprocess
import threading
import time

parser = argparse.ArgumentParser(
description='Generate AWS images across regions.')
parser.add_argument('--image-id',
required=True,
help='The source AMI ID to copy from')
parser.add_argument('--processor', required=True, help='e.g. gpu, cpu, etc.')
parser.add_argument('--region',
default='us-east-1',
help='Region of the source AMI')
parser.add_argument('--base-image-id',
default='ami-005fc0f236362e99f',
help='The base AMI of the source AMI.')
parser.add_argument('--os-type', default='ubuntu', help='The OS type')
parser.add_argument('--os-version', default='22.04', help='The OS version')
parser.add_argument('--output-csv',
default='images.csv',
help='The output CSV file name')
args = parser.parse_args()

# 25 regions
ALL_REGIONS = [
# 'us-east-1', # Source AMI is already in this region
'us-east-2',
'us-west-1',
'us-west-2',
'ca-central-1',
'eu-central-1', # need for smoke test
'eu-central-2',
'eu-west-1',
'eu-west-2',
'eu-south-1',
'eu-south-2',
'eu-west-3',
'eu-north-1',
'me-south-1',
'me-central-1',
'af-south-1',
'ap-east-1',
'ap-south-1',
'ap-south-2',
'ap-northeast-3',
'ap-northeast-2',
'ap-southeast-1',
'ap-southeast-2',
'ap-southeast-3',
'ap-northeast-1',
]


def make_image_public(image_id, region):
unblock_command = f"aws ec2 disable-image-block-public-access --region {region}"
subprocess.run(unblock_command, shell=True, check=True)
public_command = (
f'aws ec2 modify-image-attribute --image-id {image_id} '
f'--launch-permission "{{\\\"Add\\\": [{{\\\"Group\\\":\\\"all\\\"}}]}}" --region {region}'
)
subprocess.run(public_command, shell=True, check=True)
print(f"Made {image_id} public")


def copy_image_and_make_public(target_region):
# Copy the AMI to the target region
copy_command = (
f"aws ec2 copy-image --source-region {args.region} "
f"--source-image-id {args.image_id} --region {target_region} "
f"--name 'skypilot-aws-{args.processor}-{args.os_type}-{time.time()}' --output json"
)
print(copy_command)
result = subprocess.run(copy_command,
shell=True,
check=True,
capture_output=True,
text=True)
print(result.stdout)
new_image_id = json.loads(result.stdout)['ImageId']
print(f"Copied image to {target_region} with new image ID: {new_image_id}")

# Wait for the image to be available
print(f"Waiting for {new_image_id} to be available...")
wait_command = f"aws ec2 wait image-available --image-ids {new_image_id} --region {target_region}"
subprocess.run(wait_command, shell=True, check=True)

make_image_public(new_image_id, target_region)

return new_image_id


def write_image_to_csv(image_id, region):
with open(args.output_csv, 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
row = [
f'skypilot:custom-{args.processor}-{args.os_type}', region,
args.os_type, args.os_version, image_id,
time.strftime('%Y%m%d'), args.base_image_id
]
writer.writerow(row)
print(f"Wrote to CSV: {row}")


def main():
make_image_public(args.image_id, args.region)
if not os.path.exists(args.output_csv):
with open(args.output_csv, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([
'Tag', 'Region', 'OS', 'OSVersion', 'ImageId', 'CreationDate',
'BaseImageId'
]) # Header
print(f"No existing {args.output_csv} so created it.")

# Process other regions
image_cache = [(args.image_id, args.region)]

def process_region(copy_to_region):
print(f"Start copying image to {copy_to_region}...")
try:
new_image_id = copy_image_and_make_public(copy_to_region)
except Exception as e:
print(f"Error generating image to {copy_to_region}: {str(e)}")
new_image_id = 'NEED_FALLBACK'
image_cache.append((new_image_id, copy_to_region))

with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(process_region, ALL_REGIONS)
executor.shutdown(wait=True)

# Sort the images by it's region and write to CSV
sorted_image_cache = sorted(image_cache, key=lambda x: x[1])
for new_image_id, copy_to_region in sorted_image_cache:
write_image_to_csv(new_image_id, copy_to_region)

print("All done!")


if __name__ == "__main__":
main()
17 changes: 17 additions & 0 deletions sky/clouds/service_catalog/images/plugins.pkr.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
packer {
required_plugins {
amazon = {
version = ">= 1.2.8"
source = "github.com/hashicorp/amazon"
}
}
}

packer {
required_plugins {
googlecompute = {
version = ">= 1.1.1"
source = "github.com/hashicorp/googlecompute"
}
}
}
50 changes: 50 additions & 0 deletions sky/clouds/service_catalog/images/provisioners/cloud.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/bin/bash

PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python

# TODO: keep this dependency installation align with utils/controller_utils.py and setup.py
install_azure() {
echo "Install cloud dependencies on controller: Azure"
$PYTHON_EXEC -m pip install "azure-cli>=2.31.0" azure-core "azure-identity>=1.13.0" azure-mgmt-network
$PYTHON_EXEC -m pip install azure-storage-blob msgraph-sdk
}

install_gcp() {
echo "Install cloud dependencies on controller: GCP"
$PYTHON_EXEC -m pip install "google-api-python-client>=2.69.0"
$PYTHON_EXEC -m pip install google-cloud-storage
if ! gcloud --help > /dev/null 2>&1; then
pushd /tmp &>/dev/null
mkdir -p ~/.sky/logs
wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log
tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log
rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log
mv google-cloud-sdk ~/
~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1
echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc
source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1
popd &>/dev/null
fi
}

install_aws() {
echo "Install cloud dependencies on controller: AWS"
$PYTHON_EXEC -m pip install botocore>=1.29.10 boto3>=1.26.1
$PYTHON_EXEC -m pip install "urllib3<2" awscli>=1.27.10 "colorama<0.4.5"
}

if [ "$CLOUD" = "azure" ]; then
install_azure
elif [ "$CLOUD" = "gcp" ]; then
install_gcp
elif [ "$CLOUD" = "aws" ]; then
install_aws
else
echo "Error: Unknown cloud $CLOUD so not installing any cloud dependencies."
fi

if [ $? -eq 0 ]; then
echo "Successfully installed cloud dependencies on controller: $CLOUD"
else
echo "Error: Failed to install cloud dependencies on controller: $CLOUD"
fi
24 changes: 24 additions & 0 deletions sky/clouds/service_catalog/images/provisioners/cuda.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# This script installs the latest CUDA driver and toolkit version that is compatible with all GPU types.
# For CUDA driver version, choose the latest version that works for ALL GPU types.
# GCP: https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#minimum-driver
# AWS: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html
export DEBIAN_FRONTEND=noninteractive

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Make sure CUDA toolkit and driver versions are compatible: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
# Current State: Driver Version 535.183.06 and CUDA Version 12.2
sudo apt-get install -y cuda-drivers-535
sudo apt-get install -y cuda-toolkit-12-4

# Install cuDNN
# https://docs.nvidia.com/deeplearning/cudnn/latest/installation/linux.html#installing-on-linux
sudo apt-get install libcudnn8
sudo apt-get install libcudnn8-dev

# Cleanup
rm cuda-keyring_1.1-1_all.deb
Loading

0 comments on commit 92fd109

Please sign in to comment.