Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[robustness] cover some potential resource leakage cases #4443

Merged
merged 11 commits into from
Dec 8, 2024

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Dec 5, 2024

Includes the following changes:

if a newly-created cluster is missing from the cloud, wait before deleting
This protects against leaking instances that haven't appeared yet, because they were just created.
Addresses #4431, see that issue for more details.

  • This is observed to affect AWS if the timing is exactly right - there's a <1s window where the instances have not yet appeared.
  • Haven't been able to reproduce this on any other clouds.

confirm cluster actually terminates before deleting from the db
Previously, we optimistically assumed that a successful termination request will always delete cloud instances. This change validates that the instances are actually terminating or terminated before deleting them from our state database. This protects against silent failures in the termination, which are not expected, but are hypothesized to be possible.

  • This potential leak has not been reproduced.
  • This does not add any additional delay to AWS or GCP. Azure takes around 10s to transition the instances, so cluster teardown now takes that much longer.

avoid deleting cluster data outside the primary provision loop
The main provision/failover/retry loop is quite complicated. There were some areas of code which could throw an exception and delete the cluster from our state database without terminating the instances. Attempt to clean up these paths.
The previous change (confirm cluster actually terminates before deleting from the db) also makes the provision loop safer. Since the instance termination and the state cleanup are in different places, this protects against the case where we don't successfully terminate the instances but still clean up our local state.

part of #4410

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_minimal (on aws, azure, gcp)

@Michaelvll Michaelvll added this to the v0.7.1 milestone Dec 5, 2024
@cg505 cg505 changed the title Leak protection [robustness] cover some potential resource leakage cases Dec 5, 2024
@cg505 cg505 requested a review from Michaelvll December 5, 2024 23:16
@cg505 cg505 marked this pull request as ready for review December 5, 2024 23:16
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @cg505! A main comment I have is the place we do the wait for terminating / stopping state. I am leaning towards to have the cloud's specific instance.py to wait for the state transition, see:

# TODO(suquark): Currently, the implementation of GCP and Azure will
# wait util the cluster is fully terminated, while other clouds just
# trigger the termination process (via http call) and then return.
# It's not clear that which behavior should be expected. We will not
# wait for the termination for now, since this is the default behavior
# of most cloud implementations (including AWS).

vs
# Check if the instance is actually stopped.
# GCP does not fully stop an instance even after
# the stop operation is finished.
for _ in range(constants.MAX_POLLS_STOP):
handler_to_instances = _filter_instances(
handler_to_instances.keys(),
project_id,
zone,
label_filters,
lambda handler: handler.NON_STOPPED_STATES,
included_instances=all_instances,
)
if not handler_to_instances:
break
time.sleep(constants.POLL_INTERVAL)
else:
raise RuntimeError(f'Maximum number of polls: '
f'{constants.MAX_POLLS_STOP} reached. '
f'Instance {all_instances} is still not in '
'STOPPED status.')

vs
# We don't wait for the instances to be terminated, as it can take a long
# time (same as what we did in ray's node_provider).

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/provision/azure/instance.py Show resolved Hide resolved
@cg505
Copy link
Collaborator Author

cg505 commented Dec 6, 2024

A main comment I have is the place we do the wait for terminating / stopping state. I am leaning towards to have the cloud's specific instance.py to wait for the state transition

@Michaelvll A couple points:

  1. I guess the fact that this is inconsistent between clouds is more reason to make the implementation cloud-agnostic. Then we don't need to go through each cloud implementation to make sure it's doing "the right thing". (Even though this would probably be good to have.) Fixing this requires a good understanding of the exact circumstances that allow for instance state transitions and outside of the big 3 clouds this is often poorly documented or not documented at all.
  2. I think there is some confusion around waiting for the state change. Most clouds have an explicit stopping/terminating state, before reaching the ultimate stopped/terminating state. It's not clear if the comments you linked are talking about the former or the latter. For our use case, we basically think of stopping/terminating as equivalent to stopped/terminated, as discussed in the other thread.
    For instance, looking at the GCP code links:
    • For termination, the comment says "We don't wait for the instances to be terminated" but in my testing, the instances immediately transition to STOPPING, which we consider as "good enough" for our checks. (Since it's clear that our termination request succeeded at least.
    • For stopping, the comment says "GCP does not fully stop an instance even after the stop operation is finished." and the code will wait until one of STOPPED_STATES is reached (STOPPING_STATES is not good enough). But query_instances will consider STOPPING_STATES as ClusterStatus.STOPPED.

Ideally, we should definitely define, document, and enforce the exact invariants for various states and transitions, so that all the clouds have a clear indication of how it should work. But given we don't have that right now, I think this is the safest way.

P.S. If we're still concerned about the "wait for transition" change, we could make it only apply to termination, since it's much worse than stopping in terms of the leak.

cg505 added 2 commits December 6, 2024 16:18
get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.
@cg505 cg505 requested a review from Michaelvll December 7, 2024 00:29
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @cg505! Once we have the two changes we mentioned offline, it should be good to go.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Comment on lines 4154 to 4162
raise RuntimeError(f'Instance {node_id} in unexpected '
f'state {node_status}.')
break
except RuntimeError:
attempts += 1
if attempts < _TEARDOWN_WAIT_MAX_ATTEMPTS:
time.sleep(_TEARDOWN_WAIT_BETWEEN_ATTEMPS_SECONDS)
else:
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this throw and catch is not necessary? Should we just use a flag?

Copy link
Collaborator Author

@cg505 cg505 Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't re-throw if we should retry. Not sure what you mean by using a flag. never mind, I see

Copy link
Collaborator

@Michaelvll Michaelvll Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, I mean exception handling is more expensive and here it seems unnecessary to raise as we can do something like the following:

while True:
  is_fail = False
  for node in nodes: 
    if condition:
        is_fail = True
        break
  if is_fail:
      ...

@cg505 cg505 requested a review from Michaelvll December 7, 2024 01:40
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! Left two comments for the comments in the code. Otherwise, LGTM.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@cg505 cg505 enabled auto-merge (squash) December 8, 2024 01:05
@cg505 cg505 merged commit 84f09ef into skypilot-org:master Dec 8, 2024
19 checks passed
zpoint pushed a commit to zpoint/skypilot that referenced this pull request Dec 9, 2024
…g#4443)

* if a newly-created cluster is missing from the cloud, wait before deleting

Addresses skypilot-org#4431.

* confirm cluster actually terminates before deleting from the db

* avoid deleting cluster data outside the primary provision loop

* tweaks

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* use usage_intervals for new cluster detection

get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.

* fix terminating/stopping state for Lambda and Paperspace

* Revert "use usage_intervals for new cluster detection"

This reverts commit aa6d2e9.

* check cloud.STATUS_VERSION before calling query_instances

* avoid try/catch when querying instances

* update comments

---------

Co-authored-by: Zhanghao Wu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants