-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[robustness] cover some potential resource leakage cases #4443
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @cg505! A main comment I have is the place we do the wait for terminating / stopping state. I am leaning towards to have the cloud's specific instance.py to wait for the state transition, see:
skypilot/sky/provision/aws/instance.py
Lines 635 to 640 in 6e50832
# TODO(suquark): Currently, the implementation of GCP and Azure will | |
# wait util the cluster is fully terminated, while other clouds just | |
# trigger the termination process (via http call) and then return. | |
# It's not clear that which behavior should be expected. We will not | |
# wait for the termination for now, since this is the default behavior | |
# of most cloud implementations (including AWS). |
vs
skypilot/sky/provision/gcp/instance.py
Lines 491 to 511 in 6e50832
# Check if the instance is actually stopped. | |
# GCP does not fully stop an instance even after | |
# the stop operation is finished. | |
for _ in range(constants.MAX_POLLS_STOP): | |
handler_to_instances = _filter_instances( | |
handler_to_instances.keys(), | |
project_id, | |
zone, | |
label_filters, | |
lambda handler: handler.NON_STOPPED_STATES, | |
included_instances=all_instances, | |
) | |
if not handler_to_instances: | |
break | |
time.sleep(constants.POLL_INTERVAL) | |
else: | |
raise RuntimeError(f'Maximum number of polls: ' | |
f'{constants.MAX_POLLS_STOP} reached. ' | |
f'Instance {all_instances} is still not in ' | |
'STOPPED status.') | |
vs
skypilot/sky/provision/gcp/instance.py
Lines 567 to 570 in 6e50832
# We don't wait for the instances to be terminated, as it can take a long | |
# time (same as what we did in ray's node_provider). | |
Co-authored-by: Zhanghao Wu <[email protected]>
@Michaelvll A couple points:
Ideally, we should definitely define, document, and enforce the exact invariants for various states and transitions, so that all the clouds have a clear indication of how it should work. But given we don't have that right now, I think this is the safest way. P.S. If we're still concerned about the "wait for transition" change, we could make it only apply to termination, since it's much worse than stopping in terms of the leak. |
get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check.
This reverts commit aa6d2e9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @cg505! Once we have the two changes we mentioned offline, it should be good to go.
sky/backends/cloud_vm_ray_backend.py
Outdated
raise RuntimeError(f'Instance {node_id} in unexpected ' | ||
f'state {node_status}.') | ||
break | ||
except RuntimeError: | ||
attempts += 1 | ||
if attempts < _TEARDOWN_WAIT_MAX_ATTEMPTS: | ||
time.sleep(_TEARDOWN_WAIT_BETWEEN_ATTEMPS_SECONDS) | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this throw and catch is not necessary? Should we just use a flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't re-throw if we should retry. Not sure what you mean by using a flag. never mind, I see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo, I mean exception handling is more expensive and here it seems unnecessary to raise as we can do something like the following:
while True:
is_fail = False
for node in nodes:
if condition:
is_fail = True
break
if is_fail:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! Left two comments for the comments in the code. Otherwise, LGTM.
…g#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses skypilot-org#4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>
Includes the following changes:
if a newly-created cluster is missing from the cloud, wait before deleting
This protects against leaking instances that haven't appeared yet, because they were just created.
Addresses #4431, see that issue for more details.
confirm cluster actually terminates before deleting from the db
Previously, we optimistically assumed that a successful termination request will always delete cloud instances. This change validates that the instances are actually terminating or terminated before deleting them from our state database. This protects against silent failures in the termination, which are not expected, but are hypothesized to be possible.
avoid deleting cluster data outside the primary provision loop
The main provision/failover/retry loop is quite complicated. There were some areas of code which could throw an exception and delete the cluster from our state database without terminating the instances. Attempt to clean up these paths.
The previous change (confirm cluster actually terminates before deleting from the db) also makes the provision loop safer. Since the instance termination and the state cleanup are in different places, this protects against the case where we don't successfully terminate the instances but still clean up our local state.
part of #4410
Tested (run the relevant ones):
bash format.sh
sky launch --gpus H100:8
for longer failover checkingpytest tests/test_smoke.py::test_minimal
(on aws, azure, gcp)