Skip to content

Commit

Permalink
[Distributed] Allow the placement group more time to wait for resourc…
Browse files Browse the repository at this point in the history
…es to be ready (vllm-project#11138)

Signed-off-by: Jiaxin Shan <[email protected]>
  • Loading branch information
Jeffwan authored Dec 13, 2024
1 parent 0a56bcc commit 0d8451c
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions vllm/executor/ray_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,10 +277,14 @@ def initialize_ray_cluster(
f"Total number of devices: {device_bundles}.")
else:
num_devices_in_cluster = ray.cluster_resources().get(device_str, 0)
# Log a warning message and delay resource allocation failure response.
# Avoid immediate rejection to allow user-initiated placement group
# created and wait cluster to be ready
if parallel_config.world_size > num_devices_in_cluster:
raise ValueError(
f"The number of required {device_str}s exceeds the total "
f"number of available {device_str}s in the placement group.")
logger.warning(
"The number of required %ss exceeds the total "
"number of available %ss in the placement group.", device_str,
device_str)
# Create a new placement group
placement_group_specs: List[Dict[str, float]] = ([{
device_str: 1.0
Expand Down

0 comments on commit 0d8451c

Please sign in to comment.