Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] support spot pod on RunPod #4447

Merged
merged 13 commits into from
Dec 9, 2024

Conversation

weih1121
Copy link
Contributor

@weih1121 weih1121 commented Dec 6, 2024

Changes

  • Added support for Spot Pods in RunPod's low-level API, enabling GPU bidding with bid_per_gpu derived from the instance hourly cost.
  • Implemented region zone fetching for specific instance types.
  • Updated runpod-ray.yml.j2 to include Preemptible and bid_per_gpu in node configuration.

Tested launching RTXA6000x1 spot pod and say hello

Logging:
Screenshot 2024-12-07 at 00 33 23
Logs:

$ tail -f ~/sky_logs/sky-2024-12-07-00-31-40-367923/provision.log
D 12-07 00:32:25 provisioner.py:139]   "tags": {},
D 12-07 00:32:25 provisioner.py:139]   "resume_stopped_nodes": true,
D 12-07 00:32:25 provisioner.py:139]   "ports_to_open_on_launch": []
D 12-07 00:32:25 provisioner.py:139] }
I 12-07 00:32:31 instance.py:99] Launched instance o019t372yrbcfp.
I 12-07 00:32:32 instance.py:111] Waiting for instances to be ready: (0/1).
I 12-07 00:32:39 instance.py:111] Waiting for instances to be ready: (0/1).
I 12-07 00:32:45 instance.py:111] Waiting for instances to be ready: (0/1).

Launched Pod with instance id o019t372yrbcfp:
Screenshot 2024-12-07 at 00 35 32
Pod launched:
Screenshot 2024-12-07 at 00 41 18
Job Finished:
Screenshot 2024-12-07 at 00 42 44

Down cloud:

$ sky down mycluster1
Terminating 1 cluster: mycluster1. Proceed? [Y/n]: y
Terminating 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--Failed to fetch cluster status for 'mycluster1'. Assuming the cluster is still up.
Terminating cluster mycluster1...done.
Terminating 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Screenshot 2024-12-07 at 00 44 50

Another Concern | Todo

Not sure what will happen when interrupted during provision will add the test result.

On-demand pod provision and down test

$ python cli.py launch -c mycluster1 hello-sky/hello_sky.yaml            
Task from YAML spec: hello-sky/hello_sky.yaml
Considered resources (1 node):
--------------------------------------------------------------------------------------------------
 CLOUD    INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
--------------------------------------------------------------------------------------------------
 RunPod   1x_RTXA6000_SECURE   4       48        RTXA6000:1     CA            0.79          ✔     
--------------------------------------------------------------------------------------------------
Launching a new cluster 'mycluster1'. Proceed? [Y/n]: y
⠙ Launching  View logs at: ~/sky_logs/sky-2024-12-07-01-05-05-809190/provision.log__init__.py         :27   2024-12-07 01:05:08,392 Default Auth client Circuit breaker strategy enabled
Key already exists
⚙︎ Launching on RunPod CA.
W 12-07 01:05:14 instance.py:97] run_instances error: There are no longer any instances available with the requested specifications. Please refresh and try again.
sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in CA for {RunPod({'RTXA6000': 1})}. 

↺ Trying other potential resources.
Considered resources (1 node):
--------------------------------------------------------------------------------------------------
 CLOUD    INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
--------------------------------------------------------------------------------------------------
 RunPod   1x_RTXA6000_SECURE   4       48        RTXA6000:1     CZ            0.79          ✔     
--------------------------------------------------------------------------------------------------
Key already exists
⚙︎ Launching on RunPod CZ.
└── Instance is up.
✓ Cluster launched: mycluster1.  View logs at: ~/sky_logs/sky-2024-12-07-01-05-05-809190/provision.log
⚙︎ Mounting files.
  Syncing workdir (to 1 node): . -> ~/sky_workdir
✓ Workdir synced.  View logs at: ~/sky_logs/sky-2024-12-07-01-05-05-809190/workdir_sync.log
⚙︎ Running setup on 1 VM.
Running setup.
✓ Setup completed.  View logs at: ~/sky_logs/sky-2024-12-07-01-05-05-809190/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(task, pid=2317) Hello, SkyPilot!
(task, pid=2317) # conda environments:
(task, pid=2317) #
(task, pid=2317) base                  *  /root/miniconda3
(task, pid=2317) 
✓ Job finished (status: SUCCEEDED).

Job ID: 1
📋 Useful Commands
├── To cancel the job:		sky cancel mycluster1 1
├── To stream job logs:		sky logs mycluster1 1
└── To view job queue:		sky queue mycluster1

Cluster name: mycluster1
├── To log into the head VM:	ssh mycluster1
├── To submit a job:		sky exec mycluster1 yaml_file
├── To stop the cluster:	sky stop mycluster1
└── To teardown the cluster:	sky down mycluster1

(sky) 
hphilosophy at mac in ~/skypilot/sky (dev/hong/support-runpod-spot●) 
$ sky down mycluster1
Terminating 1 cluster: mycluster1. Proceed? [Y/n]: y
Terminating cluster mycluster1...done.
Terminating 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@weih1121 weih1121 marked this pull request as draft December 6, 2024 13:29
@weih1121 weih1121 changed the title [RunPod] support spot pod [Feature] support spot pod Dec 6, 2024
@weih1121 weih1121 marked this pull request as ready for review December 6, 2024 16:53
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @weih1121 for adding the support for spot instance on runpod! It looks mostly good to me.
Fixes #3927 and #4265

sky/provision/runpod/api/commands.py Show resolved Hide resolved
sky/provision/runpod/api/commands.py Outdated Show resolved Hide resolved
sky/provision/runpod/api/pods.py Show resolved Hide resolved
sky/provision/runpod/utils.py Outdated Show resolved Hide resolved
Comment on lines 154 to 185
if preemptible is None or not preemptible:
new_instance = runpod.runpod.create_pod(
name=name,
image_name=image_name,
gpu_type_id=gpu_type,
cloud_type=cloud_type,
container_disk_in_gb=disk_size,
min_vcpu_count=4 * gpu_quantity,
min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity,
gpu_count=gpu_quantity,
country_code=region,
ports=ports,
support_public_ip=True,
docker_args=docker_args,
)
else:
new_instance = create_spot_pod(
name=name,
image_name=image_name,
gpu_type_id=gpu_type,
cloud_type=cloud_type,
bid_per_gpu=bid_per_gpu,
container_disk_in_gb=disk_size,
volume_in_gb=disk_size,
min_vcpu_count=4 * gpu_quantity,
min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity,
gpu_count=gpu_quantity,
country_code=region,
ports=ports,
support_public_ip=True,
docker_args=docker_args,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we should just keep the graphql implementation for creating pod, i.e. use the same graphql implementation for on-demand as well, to make the code structure cleaner.

Copy link
Contributor Author

@weih1121 weih1121 Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are input fields defined https://graphql-spec.runpod.io/#definition-PodRentInterruptableInput and https://graphql-spec.runpod.io/#definition-PodFindAndDeployOnDemandInput.

Although some of the fields are not currently being used in runpod python API and the new API, I suggest keeping both the create spot pod and create pod implementations. This will make it easier to accommodate future changes and customer needs.

Copy link
Contributor Author

@weih1121 weih1121 Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update a bit to make it cleaner.
Also I can submit a PR to RunPod to support spot instances.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. It should be fine then. Let's keep the current two way of implementation

sky/resources.py Outdated Show resolved Hide resolved
@weih1121 weih1121 changed the title [Feature] support spot pod [Feature] support spot pod for RunPod Dec 7, 2024
@weih1121 weih1121 changed the title [Feature] support spot pod for RunPod [Feature] support spot pod on RunPod Dec 7, 2024
@weih1121 weih1121 requested a review from Michaelvll December 8, 2024 15:15
instance_type = resources.instance_type
use_spot = resources.use_spot

hourly_cost = r.cloud.instance_type_to_hourly_cost(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hourly_cost = r.cloud.instance_type_to_hourly_cost(
hourly_cost = self.instance_type_to_hourly_cost(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 154 to 185
if preemptible is None or not preemptible:
new_instance = runpod.runpod.create_pod(
name=name,
image_name=image_name,
gpu_type_id=gpu_type,
cloud_type=cloud_type,
container_disk_in_gb=disk_size,
min_vcpu_count=4 * gpu_quantity,
min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity,
gpu_count=gpu_quantity,
country_code=region,
ports=ports,
support_public_ip=True,
docker_args=docker_args,
)
else:
new_instance = create_spot_pod(
name=name,
image_name=image_name,
gpu_type_id=gpu_type,
cloud_type=cloud_type,
bid_per_gpu=bid_per_gpu,
container_disk_in_gb=disk_size,
volume_in_gb=disk_size,
min_vcpu_count=4 * gpu_quantity,
min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity,
gpu_count=gpu_quantity,
country_code=region,
ports=ports,
support_public_ip=True,
docker_args=docker_args,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. It should be fine then. Let's keep the current two way of implementation

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support @weih1121! It looks good to me.

public_key=config.node_config['PublicKey'])
public_key=config.node_config['PublicKey'],
preemptible=config.node_config['Preemptible'],
bid_per_gpu=config.node_config['bid_per_gpu'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the same naming style, i.e. BidPerGPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Updated!

@Michaelvll
Copy link
Collaborator

Thanks @weih1121! Merging it now.

@Michaelvll Michaelvll merged commit f60e385 into skypilot-org:master Dec 9, 2024
19 checks passed
cg505 added a commit to cg505/skypilot that referenced this pull request Dec 9, 2024
Fixes runpod import issues introduced in skypilot-org#4447.
@cg505 cg505 mentioned this pull request Dec 9, 2024
2 tasks
cg505 added a commit that referenced this pull request Dec 9, 2024
Fixes runpod import issues introduced in #4447.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants