-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] support spot pod on RunPod #4447
[Feature] support spot pod on RunPod #4447
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sky/provision/runpod/utils.py
Outdated
if preemptible is None or not preemptible: | ||
new_instance = runpod.runpod.create_pod( | ||
name=name, | ||
image_name=image_name, | ||
gpu_type_id=gpu_type, | ||
cloud_type=cloud_type, | ||
container_disk_in_gb=disk_size, | ||
min_vcpu_count=4 * gpu_quantity, | ||
min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity, | ||
gpu_count=gpu_quantity, | ||
country_code=region, | ||
ports=ports, | ||
support_public_ip=True, | ||
docker_args=docker_args, | ||
) | ||
else: | ||
new_instance = create_spot_pod( | ||
name=name, | ||
image_name=image_name, | ||
gpu_type_id=gpu_type, | ||
cloud_type=cloud_type, | ||
bid_per_gpu=bid_per_gpu, | ||
container_disk_in_gb=disk_size, | ||
volume_in_gb=disk_size, | ||
min_vcpu_count=4 * gpu_quantity, | ||
min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity, | ||
gpu_count=gpu_quantity, | ||
country_code=region, | ||
ports=ports, | ||
support_public_ip=True, | ||
docker_args=docker_args, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if we should just keep the graphql implementation for creating pod, i.e. use the same graphql implementation for on-demand as well, to make the code structure cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are input fields defined https://graphql-spec.runpod.io/#definition-PodRentInterruptableInput and https://graphql-spec.runpod.io/#definition-PodFindAndDeployOnDemandInput.
Although some of the fields are not currently being used in runpod python API and the new API, I suggest keeping both the create spot pod and create pod implementations. This will make it easier to accommodate future changes and customer needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update a bit to make it cleaner.
Also I can submit a PR to RunPod to support spot instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. It should be fine then. Let's keep the current two way of implementation
sky/clouds/runpod.py
Outdated
instance_type = resources.instance_type | ||
use_spot = resources.use_spot | ||
|
||
hourly_cost = r.cloud.instance_type_to_hourly_cost( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hourly_cost = r.cloud.instance_type_to_hourly_cost( | |
hourly_cost = self.instance_type_to_hourly_cost( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
sky/provision/runpod/utils.py
Outdated
if preemptible is None or not preemptible: | ||
new_instance = runpod.runpod.create_pod( | ||
name=name, | ||
image_name=image_name, | ||
gpu_type_id=gpu_type, | ||
cloud_type=cloud_type, | ||
container_disk_in_gb=disk_size, | ||
min_vcpu_count=4 * gpu_quantity, | ||
min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity, | ||
gpu_count=gpu_quantity, | ||
country_code=region, | ||
ports=ports, | ||
support_public_ip=True, | ||
docker_args=docker_args, | ||
) | ||
else: | ||
new_instance = create_spot_pod( | ||
name=name, | ||
image_name=image_name, | ||
gpu_type_id=gpu_type, | ||
cloud_type=cloud_type, | ||
bid_per_gpu=bid_per_gpu, | ||
container_disk_in_gb=disk_size, | ||
volume_in_gb=disk_size, | ||
min_vcpu_count=4 * gpu_quantity, | ||
min_memory_in_gb=gpu_specs['memoryInGb'] * gpu_quantity, | ||
gpu_count=gpu_quantity, | ||
country_code=region, | ||
ports=ports, | ||
support_public_ip=True, | ||
docker_args=docker_args, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. It should be fine then. Let's keep the current two way of implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support @weih1121! It looks good to me.
sky/provision/runpod/instance.py
Outdated
public_key=config.node_config['PublicKey']) | ||
public_key=config.node_config['PublicKey'], | ||
preemptible=config.node_config['Preemptible'], | ||
bid_per_gpu=config.node_config['bid_per_gpu'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the same naming style, i.e. BidPerGPU
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Updated!
Thanks @weih1121! Merging it now. |
Fixes runpod import issues introduced in skypilot-org#4447.
Fixes runpod import issues introduced in #4447.
Changes
Tested launching RTXA6000x1 spot pod and say hello
Logging:
Logs:
Launched Pod with instance id
o019t372yrbcfp
:Pod launched:
Job Finished:
Down cloud:
Another Concern | Todo
Not sure what will happen when interrupted during provision will add the test result.
On-demand pod provision and down test
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh