[Feature] support spot pod on RunPod #4447

weih1121 · 2024-12-06T13:24:59Z

Changes

Added support for Spot Pods in RunPod's low-level API, enabling GPU bidding with bid_per_gpu derived from the instance hourly cost.
Implemented region zone fetching for specific instance types.
Updated runpod-ray.yml.j2 to include Preemptible and bid_per_gpu in node configuration.

Tested launching RTXA6000x1 spot pod and say hello

Logging:

Logs:

$ tail -f ~/sky_logs/sky-2024-12-07-00-31-40-367923/provision.log
D 12-07 00:32:25 provisioner.py:139]   "tags": {},
D 12-07 00:32:25 provisioner.py:139]   "resume_stopped_nodes": true,
D 12-07 00:32:25 provisioner.py:139]   "ports_to_open_on_launch": []
D 12-07 00:32:25 provisioner.py:139] }
I 12-07 00:32:31 instance.py:99] Launched instance o019t372yrbcfp.
I 12-07 00:32:32 instance.py:111] Waiting for instances to be ready: (0/1).
I 12-07 00:32:39 instance.py:111] Waiting for instances to be ready: (0/1).
I 12-07 00:32:45 instance.py:111] Waiting for instances to be ready: (0/1).

Launched Pod with instance id o019t372yrbcfp:

Pod launched:

Job Finished:

Down cloud:

$ sky down mycluster1
Terminating 1 cluster: mycluster1. Proceed? [Y/n]: y
Terminating 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--Failed to fetch cluster status for 'mycluster1'. Assuming the cluster is still up.
Terminating cluster mycluster1...done.
Terminating 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Another Concern | Todo

Not sure what will happen when interrupted during provision will add the test result.

On-demand pod provision and down test

$ python cli.py launch -c mycluster1 hello-sky/hello_sky.yaml            
Task from YAML spec: hello-sky/hello_sky.yaml
Considered resources (1 node):
--------------------------------------------------------------------------------------------------
 CLOUD    INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
--------------------------------------------------------------------------------------------------
 RunPod   1x_RTXA6000_SECURE   4       48        RTXA6000:1     CA            0.79          ✔     
--------------------------------------------------------------------------------------------------
Launching a new cluster 'mycluster1'. Proceed? [Y/n]: y
⠙ Launching  View logs at: ~/sky_logs/sky-2024-12-07-01-05-05-809190/provision.log__init__.py         :27   2024-12-07 01:05:08,392 Default Auth client Circuit breaker strategy enabled
Key already exists
⚙︎ Launching on RunPod CA.
W 12-07 01:05:14 instance.py:97] run_instances error: There are no longer any instances available with the requested specifications. Please refresh and try again.
sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in CA for {RunPod({'RTXA6000': 1})}. 

↺ Trying other potential resources.
Considered resources (1 node):
--------------------------------------------------------------------------------------------------
 CLOUD    INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
--------------------------------------------------------------------------------------------------
 RunPod   1x_RTXA6000_SECURE   4       48        RTXA6000:1     CZ            0.79          ✔     
--------------------------------------------------------------------------------------------------
Key already exists
⚙︎ Launching on RunPod CZ.
└── Instance is up.
✓ Cluster launched: mycluster1.  View logs at: ~/sky_logs/sky-2024-12-07-01-05-05-809190/provision.log
⚙︎ Mounting files.
  Syncing workdir (to 1 node): . -> ~/sky_workdir
✓ Workdir synced.  View logs at: ~/sky_logs/sky-2024-12-07-01-05-05-809190/workdir_sync.log
⚙︎ Running setup on 1 VM.
Running setup.
✓ Setup completed.  View logs at: ~/sky_logs/sky-2024-12-07-01-05-05-809190/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(task, pid=2317) Hello, SkyPilot!
(task, pid=2317) # conda environments:
(task, pid=2317) #
(task, pid=2317) base                  *  /root/miniconda3
(task, pid=2317) 
✓ Job finished (status: SUCCEEDED).

Job ID: 1
📋 Useful Commands
├── To cancel the job:		sky cancel mycluster1 1
├── To stream job logs:		sky logs mycluster1 1
└── To view job queue:		sky queue mycluster1

Cluster name: mycluster1
├── To log into the head VM:	ssh mycluster1
├── To submit a job:		sky exec mycluster1 yaml_file
├── To stop the cluster:	sky stop mycluster1
└── To teardown the cluster:	sky down mycluster1

(sky) 
hphilosophy at mac in ~/skypilot/sky (dev/hong/support-runpod-spot●) 
$ sky down mycluster1
Terminating 1 cluster: mycluster1. Proceed? [Y/n]: y
Terminating cluster mycluster1...done.
Terminating 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

sky/provision/runpod/instance.py

Michaelvll

Thanks @weih1121 for adding the support for spot instance on runpod! It looks mostly good to me.
Fixes #3927 and #4265

sky/provision/runpod/api/commands.py

sky/provision/runpod/api/pods.py

sky/provision/runpod/utils.py

Michaelvll · 2024-12-06T17:56:40Z

sky/provision/runpod/utils.py

+    if preemptible is None or not preemptible:
+        new_instance = runpod.runpod.create_pod(
+            name=name,
+            image_name=image_name,
+            gpu_type_id=gpu_type,
+            cloud_type=cloud_type,
+            container_disk_in_gb=disk_size,
+            min_vcpu_count=4 * gpu_quantity,
+            min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity,
+            gpu_count=gpu_quantity,
+            country_code=region,
+            ports=ports,
+            support_public_ip=True,
+            docker_args=docker_args,
+        )
+    else:
+        new_instance = create_spot_pod(
+            name=name,
+            image_name=image_name,
+            gpu_type_id=gpu_type,
+            cloud_type=cloud_type,
+            bid_per_gpu=bid_per_gpu,
+            container_disk_in_gb=disk_size,
+            volume_in_gb=disk_size,
+            min_vcpu_count=4 * gpu_quantity,
+            min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity,
+            gpu_count=gpu_quantity,
+            country_code=region,
+            ports=ports,
+            support_public_ip=True,
+            docker_args=docker_args,
+        )


Just wondering if we should just keep the graphql implementation for creating pod, i.e. use the same graphql implementation for on-demand as well, to make the code structure cleaner.

There are input fields defined https://graphql-spec.runpod.io/#definition-PodRentInterruptableInput and https://graphql-spec.runpod.io/#definition-PodFindAndDeployOnDemandInput.

Although some of the fields are not currently being used in runpod python API and the new API, I suggest keeping both the create spot pod and create pod implementations. This will make it easier to accommodate future changes and customer needs.

Update a bit to make it cleaner.
Also I can submit a PR to RunPod to support spot instances.

Ahh, I see. It should be fine then. Let's keep the current two way of implementation

sky/resources.py

Michaelvll · 2024-12-08T18:58:59Z

sky/clouds/runpod.py

+        instance_type = resources.instance_type
+        use_spot = resources.use_spot
+
+        hourly_cost = r.cloud.instance_type_to_hourly_cost(


Suggested change

hourly_cost = r.cloud.instance_type_to_hourly_cost(

hourly_cost = self.instance_type_to_hourly_cost(

Michaelvll · 2024-12-08T19:01:26Z

sky/provision/runpod/utils.py

+    if preemptible is None or not preemptible:
+        new_instance = runpod.runpod.create_pod(
+            name=name,
+            image_name=image_name,
+            gpu_type_id=gpu_type,
+            cloud_type=cloud_type,
+            container_disk_in_gb=disk_size,
+            min_vcpu_count=4 * gpu_quantity,
+            min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity,
+            gpu_count=gpu_quantity,
+            country_code=region,
+            ports=ports,
+            support_public_ip=True,
+            docker_args=docker_args,
+        )
+    else:
+        new_instance = create_spot_pod(
+            name=name,
+            image_name=image_name,
+            gpu_type_id=gpu_type,
+            cloud_type=cloud_type,
+            bid_per_gpu=bid_per_gpu,
+            container_disk_in_gb=disk_size,
+            volume_in_gb=disk_size,
+            min_vcpu_count=4 * gpu_quantity,
+            min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity,
+            gpu_count=gpu_quantity,
+            country_code=region,
+            ports=ports,
+            support_public_ip=True,
+            docker_args=docker_args,
+        )


Ahh, I see. It should be fine then. Let's keep the current two way of implementation

Michaelvll

Thanks for adding the support @weih1121! It looks good to me.

Michaelvll · 2024-12-09T05:01:14Z

sky/provision/runpod/instance.py

-                public_key=config.node_config['PublicKey'])
+                public_key=config.node_config['PublicKey'],
+                preemptible=config.node_config['Preemptible'],
+                bid_per_gpu=config.node_config['bid_per_gpu'],


Should we keep the same naming style, i.e. BidPerGPU?

Good catch! Updated!

Michaelvll · 2024-12-09T17:48:18Z

Thanks @weih1121! Merging it now.

Fixes runpod import issues introduced in skypilot-org#4447.

Fixes runpod import issues introduced in #4447.

hwei added 5 commits December 5, 2024 22:27

wip

c70db10

wip

9ffd1d5

wip

f0a04da

wip

ee7f927

wip

8b6a0c1

weih1121 commented Dec 6, 2024

View reviewed changes

sky/provision/runpod/instance.py Outdated Show resolved Hide resolved

weih1121 marked this pull request as draft December 6, 2024 13:29

weih1121 changed the title ~~[RunPod] support spot pod~~ [Feature] support spot pod Dec 6, 2024

wip

79881ee

weih1121 marked this pull request as ready for review December 6, 2024 16:53

Michaelvll reviewed Dec 6, 2024

View reviewed changes

weih1121 changed the title ~~[Feature] support spot pod~~ [Feature] support spot pod for RunPod Dec 7, 2024

weih1121 changed the title ~~[Feature] support spot pod for RunPod~~ [Feature] support spot pod on RunPod Dec 7, 2024

weih1121 added 3 commits December 8, 2024 22:37

resolve comments

e3e128e

wip

8ec5d07

wip

22ce450

weih1121 requested a review from Michaelvll December 8, 2024 15:15

Michaelvll reviewed Dec 8, 2024

View reviewed changes

Michaelvll approved these changes Dec 8, 2024

View reviewed changes

weih1121 added 3 commits December 9, 2024 10:11

wip

e09a321

wip

9003a47

wip

8bec5cb

Michaelvll reviewed Dec 9, 2024

View reviewed changes

wip

87e6376

Michaelvll merged commit f60e385 into skypilot-org:master Dec 9, 2024
19 checks passed

This was referenced Dec 9, 2024

Spot instance support for runpod. #3927

Closed

runpod 4090 spot not available #4265

Closed

cg505 added a commit to cg505/skypilot that referenced this pull request Dec 9, 2024

use lazy import for runpod

cfe705d

Fixes runpod import issues introduced in skypilot-org#4447.

cg505 mentioned this pull request Dec 9, 2024

use lazy import for runpod #4451

Merged

2 tasks

cg505 added a commit that referenced this pull request Dec 9, 2024

use lazy import for runpod (#4451)

e036888

Fixes runpod import issues introduced in #4447.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] support spot pod on RunPod #4447

[Feature] support spot pod on RunPod #4447

weih1121 commented Dec 6, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll Dec 6, 2024

weih1121 Dec 8, 2024 •

edited

Loading

weih1121 Dec 8, 2024 •

edited

Loading

Michaelvll Dec 8, 2024

Michaelvll Dec 8, 2024

weih1121 Dec 9, 2024

Michaelvll Dec 8, 2024

Michaelvll left a comment

Michaelvll Dec 9, 2024

weih1121 Dec 9, 2024

Michaelvll commented Dec 9, 2024

	hourly_cost = r.cloud.instance_type_to_hourly_cost(
	hourly_cost = self.instance_type_to_hourly_cost(

[Feature] support spot pod on RunPod #4447

[Feature] support spot pod on RunPod #4447

Conversation

weih1121 commented Dec 6, 2024 • edited Loading

Changes

Tested launching RTXA6000x1 spot pod and say hello

Another Concern | Todo

On-demand pod provision and down test

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Dec 6, 2024

Choose a reason for hiding this comment

weih1121 Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

weih1121 Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

Michaelvll Dec 8, 2024

Choose a reason for hiding this comment

Michaelvll Dec 8, 2024

Choose a reason for hiding this comment

weih1121 Dec 9, 2024

Choose a reason for hiding this comment

Michaelvll Dec 8, 2024

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Dec 9, 2024

Choose a reason for hiding this comment

weih1121 Dec 9, 2024

Choose a reason for hiding this comment

Michaelvll commented Dec 9, 2024

weih1121 commented Dec 6, 2024 •

edited

Loading

weih1121 Dec 8, 2024 •

edited

Loading

weih1121 Dec 8, 2024 •

edited

Loading