-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] support to use custom gpu resource name if it's not nvidia.com/gpu #4337
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and welcome to SkyPilot @nkwangleiGIT!
sky/utils/common_utils.py
Outdated
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'): | ||
"""Get the GPU resource name. | ||
The function first checks for an environment variable. | ||
If defined, it uses its value; otherwise, it returns the default value. | ||
Args: | ||
name (str): Default GPU resource name, default is "nvidia.com/gpu". | ||
Returns: | ||
str: The selected GPU resource name. | ||
""" | ||
# retrieve GPU resource name from environment variable, | ||
# can be nvidia.com/gpu-h100 if it's customized by the device plugin | ||
custom_name = os.getenv('CUSTOM_GPU_RESOURCE_NAME') | ||
|
||
# If the environment variable is not defined, return the default name | ||
if custom_name is None: | ||
return name | ||
|
||
return custom_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to sky.provision.kubernetes.utils
since only k8s relies on this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to sky.provision.kubernetes.utils
sky/utils/common_utils.py
Outdated
# retrieve GPU resource name from environment variable, | ||
# can be nvidia.com/gpu-h100 if it's customized by the device plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
# retrieve GPU resource name from environment variable, | |
# can be nvidia.com/gpu-h100 if it's customized by the device plugin | |
# Retrieve GPU resource name from environment variable, if set. | |
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc. |
sky/utils/common_utils.py
Outdated
|
||
|
||
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'): | ||
"""Get the GPU resource name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Get the GPU resource name. | |
"""Get the GPU resource name to use in kubernetes. |
sky/utils/common_utils.py
Outdated
@@ -697,3 +697,26 @@ def truncate_long_string(s: str, max_length: int = 35) -> str: | |||
if len(prefix) < max_length: | |||
prefix += s[len(prefix):max_length] | |||
return prefix + '...' | |||
|
|||
|
|||
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having name as a arg, we probably want to move the hardcoded string to the top of kubernetes_utils.py as DEFAULT_GPU_RESOURCE = 'nvidia.com/gpu'
and use that in the method if the envvar does not exist.
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'): | |
def get_gpu_resource_name(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sky/provision/kubernetes/utils.py
Outdated
total={ | ||
common_utils.get_gpu_resource_name(): int(accelerator_count) | ||
}, | ||
free={ | ||
common_utils.get_gpu_resource_name(): | ||
int(accelerators_available) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaner to invoke once:
total={ | |
common_utils.get_gpu_resource_name(): int(accelerator_count) | |
}, | |
free={ | |
common_utils.get_gpu_resource_name(): | |
int(accelerators_available) | |
}) | |
gpu_resource = common_utils.get_gpu_resource_name() | |
total={gpu_resource: int(accelerator_count)}, | |
free={gpu_resource: int(accelerators_available) | |
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
18f7aac
to
8538f07
Compare
Hi @nkwangleiGIT - this PR was close to being merged. Would you like to reopen it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nkwangleiGIT! Mostly looks good, left some minor comments. Should be good to go soon!
@@ -2233,10 +2233,11 @@ def get_node_accelerator_count(attribute_dict: dict) -> int: | |||
Number of accelerators allocated or available from the node. If no | |||
resource is found, it returns 0. | |||
""" | |||
assert not (GPU_RESOURCE_KEY in attribute_dict and | |||
gpu_resource_name = get_gpu_resource_key() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
gpu_resource_name = get_gpu_resource_key() | |
gpu_resource_key = get_gpu_resource_key() |
sky/provision/kubernetes/utils.py
Outdated
# Retrieve GPU resource name from environment variable, if set. | ||
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc. | ||
custom_name = os.getenv('CUSTOM_GPU_RESOURCE_KEY') | ||
|
||
# If the environment variable is not defined, return the default name | ||
if custom_name is None: | ||
return GPU_RESOURCE_KEY | ||
|
||
return custom_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can simplify logic to:
# Retrieve GPU resource name from environment variable, if set. | |
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc. | |
custom_name = os.getenv('CUSTOM_GPU_RESOURCE_KEY') | |
# If the environment variable is not defined, return the default name | |
if custom_name is None: | |
return GPU_RESOURCE_KEY | |
return custom_name | |
# Retrieve GPU resource name from environment variable, if set. Else use default. | |
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc. | |
return os.getenv('CUSTOM_GPU_RESOURCE_KEY', default = GPU_RESOURCE_KEY) |
@@ -2395,3 +2396,23 @@ def process_skypilot_pods( | |||
num_pods = len(cluster.pods) | |||
cluster.resources_str = f'{num_pods}x {cluster.resources}' | |||
return list(clusters.values()), jobs_controllers, serve_controllers | |||
|
|||
|
|||
def get_gpu_resource_key(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also update L194 in sky/provision/kubernetes/instance.py to use get_gpu_resource_key() to make sure errors are handled correctly?
skypilot/sky/provision/kubernetes/instance.py
Line 194 in c1726ae
elif (('Insufficient nvidia.com/gpu' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: nkwangleiGIT <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @nkwangleiGIT - lgtm. Should be good to merge after linter passes.
Hey @nkwangleiGIT, quick follow up - you can run the linter with |
c070a45
to
2a59cf8
Compare
sorry for late, just fetch the latest code, fix format and lint issue, and push again now |
Signed-off-by: nkwangleiGIT <[email protected]>
We're using customized gpu resource name when using the device plugin, so we have the following GPUs in the node capability:
So we have to use the resource names above when launch to local K8S, such as
So this PR will support to use CUSTOM_GPU_RESOURCE_NAME from environment variable to overwrite the default nvidia.com/gpu in the resources.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh