Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] Support fractional A10 instance types #3877

Merged
merged 29 commits into from
Oct 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
de35be4
fix
cblmemo Aug 26, 2024
39d6c15
change catalog to float gpu num
cblmemo Aug 27, 2024
7324504
support print float point gpu in sky launch. TODO: test if the ray de…
cblmemo Aug 27, 2024
347ad62
fix unittest
cblmemo Aug 27, 2024
71af06e
format
cblmemo Aug 27, 2024
d419442
patch ray resources to ceil value
cblmemo Aug 27, 2024
f529689
support launch from --gpus A10
cblmemo Aug 28, 2024
2031a50
only allow strictly match fractional gpu counts
cblmemo Aug 28, 2024
07e47d6
address comment
cblmemo Sep 3, 2024
639c686
Merge remote-tracking branch 'origin/master' into support-fractional-a10
cblmemo Sep 3, 2024
4c45ff7
Merge remote-tracking branch 'origin/master' into support-fractional-a10
cblmemo Sep 6, 2024
84d6d0d
change back condition
cblmemo Sep 7, 2024
eca7033
fix
cblmemo Sep 7, 2024
0055fc1
apply suggestions from code review
cblmemo Sep 11, 2024
9652119
fix
cblmemo Sep 11, 2024
a5c5b15
Update sky/backends/cloud_vm_ray_backend.py
cblmemo Sep 11, 2024
d2cff96
format
cblmemo Sep 11, 2024
e8e9954
fix display of fuzzy candidates
cblmemo Sep 11, 2024
db607fa
fix precision issue
cblmemo Sep 12, 2024
e98ecdc
fix num gpu required
cblmemo Sep 12, 2024
8ada7a2
refactor in check_resources_fit_cluster
cblmemo Oct 11, 2024
f6c9fad
change type annotation of acc_count
cblmemo Oct 11, 2024
a1f59a0
enable fuzzy fp acc count
cblmemo Oct 11, 2024
bcbf5ec
Merge remote-tracking branch 'origin/master' into support-fractional-a10
cblmemo Oct 11, 2024
3200d39
fix k8s
cblmemo Oct 11, 2024
6e41da5
Merge remote-tracking branch 'origin/master' into support-fractional-a10
cblmemo Oct 25, 2024
fb3049f
Update sky/clouds/service_catalog/common.py
cblmemo Oct 25, 2024
82d442f
fix integer gpus
cblmemo Oct 25, 2024
84d146c
format
cblmemo Oct 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion sky/clouds/service_catalog/azure_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,15 @@
_DEFAULT_NUM_VCPUS = 8
_DEFAULT_MEMORY_CPU_RATIO = 4

# Some A10 instance types only contains a fractional of GPU. We temporarily
# filter them out here to avoid using it as a whole A10 GPU.
# TODO(zhwu,tian): support fractional GPUs, which can be done on
# kubernetes as well.
# Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/nva10v5-series
_FILTERED_A10_INSTANCE_TYPES = [
f'Standard_NV{vcpu}ads_A10_v5' for vcpu in [6, 12, 18]
]


def instance_type_exists(instance_type: str) -> bool:
return common.instance_type_exists_impl(_df, instance_type)
Expand Down Expand Up @@ -138,7 +147,12 @@ def get_instance_type_for_accelerator(
if zone is not None:
with ux_utils.print_exception_no_traceback():
raise ValueError('Azure does not support zones.')
return common.get_instance_type_for_accelerator_impl(df=_df,

# Filter out instance types that only contain a fractional of GPU.
df_filtered = _df.loc[~_df['InstanceType'].isin(_FILTERED_A10_INSTANCE_TYPES
)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of excluding the instances directly, can we print out some hints like the one when we specify sky launch --gpus L4:

Multiple AWS instances satisfy L4:1. The cheapest AWS(g6.xlarge, {'L4': 1}) is considered among:
I 08-27 06:09:54 optimizer.py:922] ['g6.xlarge', 'g6.2xlarge', 'g6.4xlarge', 'gr6.4xlarge', 'g6.8xlarge', 'gr6.8xlarge', 'g6.16xlarge'].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hint is used to print instances w/ same accelerator number. I'm thinking if we should do this to fractional GPUs..


return common.get_instance_type_for_accelerator_impl(df=df_filtered,
acc_name=acc_name,
acc_count=acc_count,
cpus=cpus,
Expand Down
13 changes: 0 additions & 13 deletions sky/clouds/service_catalog/data_fetchers/fetch_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,15 +93,6 @@ def get_regions() -> List[str]:
# We have to manually remove it.
DEPRECATED_FAMILIES = ['standardNVSv2Family']

# Some A10 instance types only contains a fractional of GPU. We temporarily
# filter them out here to avoid using it as a whole A10 GPU.
# TODO(zhwu,tian): support fractional GPUs, which can be done on
# kubernetes as well.
# Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/nva10v5-series
FILTERED_A10_INSTANCE_TYPES = [
f'Standard_NV{vcpu}ads_A10_v5' for vcpu in [6, 12, 18]
]

USEFUL_COLUMNS = [
'InstanceType', 'AcceleratorName', 'AcceleratorCount', 'vCPUs', 'MemoryGiB',
'GpuInfo', 'Price', 'SpotPrice', 'Region', 'Generation'
Expand Down Expand Up @@ -299,10 +290,6 @@ def get_additional_columns(row):
after_drop_len = len(df_ret)
print(f'Dropped {before_drop_len - after_drop_len} duplicated rows')

# Filter out instance types that only contain a fractional of GPU.
df_ret = df_ret.loc[~df_ret['InstanceType'].isin(FILTERED_A10_INSTANCE_TYPES
)]

# Filter out deprecated families
df_ret = df_ret.loc[~df_ret['family'].isin(DEPRECATED_FAMILIES)]
df_ret = df_ret[USEFUL_COLUMNS]
Expand Down
Loading