[robustness] cover some potential resource leakage cases #4443

cg505 · 2024-12-05T17:22:26Z

Includes the following changes:

if a newly-created cluster is missing from the cloud, wait before deleting
This protects against leaking instances that haven't appeared yet, because they were just created.
Addresses #4431, see that issue for more details.

This is observed to affect AWS if the timing is exactly right - there's a <1s window where the instances have not yet appeared.
Haven't been able to reproduce this on any other clouds.

confirm cluster actually terminates before deleting from the db
Previously, we optimistically assumed that a successful termination request will always delete cloud instances. This change validates that the instances are actually terminating or terminated before deleting them from our state database. This protects against silent failures in the termination, which are not expected, but are hypothesized to be possible.

This potential leak has not been reproduced.
This does not add any additional delay to AWS or GCP. Azure takes around 10s to transition the instances, so cluster teardown now takes that much longer.

avoid deleting cluster data outside the primary provision loop
The main provision/failover/retry loop is quite complicated. There were some areas of code which could throw an exception and delete the cluster from our state database without terminating the instances. Attempt to clean up these paths.
The previous change (confirm cluster actually terminates before deleting from the db) also makes the provision loop safer. Since the instance termination and the state cleanup are in different places, this protects against the case where we don't successfully terminate the instances but still clean up our local state.

part of #4410

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- Manually tried to reproduce leaks on AWS, Azure, and GCP. For successfully reproduced leaks (Race between status update and instance creation can cause resource leak #4431), verified the fix.
- Normal launch and failover on these clouds.
- sky launch --gpus H100:8 for longer failover checking
Relevant individual smoke tests: pytest tests/test_smoke.py::test_minimal (on aws, azure, gcp)

…eting Addresses skypilot-org#4431.

sky/provision/azure/instance.py

Michaelvll

Thanks for the fix @cg505! A main comment I have is the place we do the wait for terminating / stopping state. I am leaning towards to have the cloud's specific instance.py to wait for the state transition, see:

skypilot/sky/provision/aws/instance.py

Lines 635 to 640 in 6e50832

    
           # TODO(suquark): Currently, the implementation of GCP and Azure will 
        
           #  wait util the cluster is fully terminated, while other clouds just 
        
           #  trigger the termination process (via http call) and then return. 
        
           #  It's not clear that which behavior should be expected. We will not 
        
           #  wait for the termination for now, since this is the default behavior 
        
           #  of most cloud implementations (including AWS).

vs

skypilot/sky/provision/gcp/instance.py

Lines 491 to 511 in 6e50832

    
           # Check if the instance is actually stopped. 
        
           # GCP does not fully stop an instance even after 
        
           # the stop operation is finished. 
        
           for _ in range(constants.MAX_POLLS_STOP): 
        
               handler_to_instances = _filter_instances( 
        
                   handler_to_instances.keys(), 
        
                   project_id, 
        
                   zone, 
        
                   label_filters, 
        
                   lambda handler: handler.NON_STOPPED_STATES, 
        
                   included_instances=all_instances, 
        
               ) 
        
               if not handler_to_instances: 
        
                   break 
        
               time.sleep(constants.POLL_INTERVAL) 
        
           else: 
        
               raise RuntimeError(f'Maximum number of polls: ' 
        
                                  f'{constants.MAX_POLLS_STOP} reached. ' 
        
                                  f'Instance {all_instances} is still not in ' 
        
                                  'STOPPED status.')

vs

skypilot/sky/provision/gcp/instance.py

Lines 567 to 570 in 6e50832

    
           # We don't wait for the instances to be terminated, as it can take a long 
        
           # time (same as what we did in ray's node_provider).

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

sky/provision/azure/instance.py

Co-authored-by: Zhanghao Wu <[email protected]>

cg505 · 2024-12-06T23:05:07Z

A main comment I have is the place we do the wait for terminating / stopping state. I am leaning towards to have the cloud's specific instance.py to wait for the state transition

@Michaelvll A couple points:

I guess the fact that this is inconsistent between clouds is more reason to make the implementation cloud-agnostic. Then we don't need to go through each cloud implementation to make sure it's doing "the right thing". (Even though this would probably be good to have.) Fixing this requires a good understanding of the exact circumstances that allow for instance state transitions and outside of the big 3 clouds this is often poorly documented or not documented at all.
I think there is some confusion around waiting for the state change. Most clouds have an explicit stopping/terminating state, before reaching the ultimate stopped/terminating state. It's not clear if the comments you linked are talking about the former or the latter. For our use case, we basically think of stopping/terminating as equivalent to stopped/terminated, as discussed in the other thread.
For instance, looking at the GCP code links:
- For termination, the comment says "We don't wait for the instances to be terminated" but in my testing, the instances immediately transition to STOPPING, which we consider as "good enough" for our checks. (Since it's clear that our termination request succeeded at least.
- For stopping, the comment says "GCP does not fully stop an instance even after the stop operation is finished." and the code will wait until one of STOPPED_STATES is reached (STOPPING_STATES is not good enough). But query_instances will consider STOPPING_STATES as ClusterStatus.STOPPED.

Ideally, we should definitely define, document, and enforce the exact invariants for various states and transitions, so that all the clouds have a clear indication of how it should work. But given we don't have that right now, I think this is the safest way.

P.S. If we're still concerned about the "wait for transition" change, we could make it only apply to termination, since it's much worse than stopping in terms of the leak.

get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check.

This reverts commit aa6d2e9.

Michaelvll

thanks @cg505! Once we have the two changes we mentioned offline, it should be good to go.

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2024-12-07T01:25:48Z

sky/backends/cloud_vm_ray_backend.py

+                        raise RuntimeError(f'Instance {node_id} in unexpected '
+                                           f'state {node_status}.')
+                break
+            except RuntimeError:
+                attempts += 1
+                if attempts < _TEARDOWN_WAIT_MAX_ATTEMPTS:
+                    time.sleep(_TEARDOWN_WAIT_BETWEEN_ATTEMPS_SECONDS)
+                else:
+                    raise


Seems this throw and catch is not necessary? Should we just use a flag?

~~It won't re-throw if we should retry. Not sure what you mean by using a flag.~~ never mind, I see

Ooo, I mean exception handling is more expensive and here it seems unnecessary to raise as we can do something like the following:

while True: is_fail = False for node in nodes: if condition: is_fail = True break if is_fail: ...

Michaelvll

Thanks @cg505! Left two comments for the comments in the code. Otherwise, LGTM.

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

…g#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses skypilot-org#4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>

cg505 added 3 commits December 5, 2024 09:20

if a newly-created cluster is missing from the cloud, wait before del…

4daf500

…eting Addresses skypilot-org#4431.

confirm cluster actually terminates before deleting from the db

b18b959

avoid deleting cluster data outside the primary provision loop

43e459e

Michaelvll added this to the v0.7.1 milestone Dec 5, 2024

cg505 commented Dec 5, 2024

View reviewed changes

sky/provision/azure/instance.py Show resolved Hide resolved

tweaks

71d3e48

cg505 changed the title ~~Leak protection~~ [robustness] cover some potential resource leakage cases Dec 5, 2024

cg505 requested a review from Michaelvll December 5, 2024 23:16

cg505 marked this pull request as ready for review December 5, 2024 23:16

Michaelvll reviewed Dec 6, 2024

View reviewed changes

Apply suggestions from code review

6c9f586

Co-authored-by: Zhanghao Wu <[email protected]>

cg505 added 2 commits December 6, 2024 16:18

use usage_intervals for new cluster detection

aa6d2e9

get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check.

fix terminating/stopping state for Lambda and Paperspace

6a3c2a5

cg505 requested a review from Michaelvll December 7, 2024 00:29

Revert "use usage_intervals for new cluster detection"

18a992c

This reverts commit aa6d2e9.

Michaelvll reviewed Dec 7, 2024

View reviewed changes

cg505 added 2 commits December 6, 2024 17:31

check cloud.STATUS_VERSION before calling query_instances

454ffa9

avoid try/catch when querying instances

0036d26

cg505 requested a review from Michaelvll December 7, 2024 01:40

Michaelvll approved these changes Dec 7, 2024

View reviewed changes

sky/backends/backend_utils.py Outdated Show resolved Hide resolved

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

update comments

a63eaf5

cg505 enabled auto-merge (squash) December 8, 2024 01:05

cg505 merged commit 84f09ef into skypilot-org:master Dec 8, 2024
19 checks passed

Michaelvll mentioned this pull request Dec 9, 2024

[Jobs/Core] Leakage of sky jobs cancel #4410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[robustness] cover some potential resource leakage cases #4443

[robustness] cover some potential resource leakage cases #4443

cg505 commented Dec 5, 2024 •

edited

Loading

Michaelvll left a comment

cg505 commented Dec 6, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll Dec 7, 2024

cg505 Dec 7, 2024 •

edited

Loading

Michaelvll Dec 7, 2024 •

edited

Loading

Michaelvll left a comment

	# TODO(suquark): Currently, the implementation of GCP and Azure will
	# wait util the cluster is fully terminated, while other clouds just
	# trigger the termination process (via http call) and then return.
	# It's not clear that which behavior should be expected. We will not
	# wait for the termination for now, since this is the default behavior
	# of most cloud implementations (including AWS).

	# Check if the instance is actually stopped.
	# GCP does not fully stop an instance even after
	# the stop operation is finished.
	for _ in range(constants.MAX_POLLS_STOP):
	handler_to_instances = _filter_instances(
	handler_to_instances.keys(),
	project_id,
	zone,
	label_filters,
	lambda handler: handler.NON_STOPPED_STATES,
	included_instances=all_instances,
	)
	if not handler_to_instances:
	break
	time.sleep(constants.POLL_INTERVAL)
	else:
	raise RuntimeError(f'Maximum number of polls: '
	f'{constants.MAX_POLLS_STOP} reached. '
	f'Instance {all_instances} is still not in '
	'STOPPED status.')

	# We don't wait for the instances to be terminated, as it can take a long
	# time (same as what we did in ray's node_provider).

[robustness] cover some potential resource leakage cases #4443

[robustness] cover some potential resource leakage cases #4443

Conversation

cg505 commented Dec 5, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

cg505 commented Dec 6, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Dec 7, 2024

Choose a reason for hiding this comment

cg505 Dec 7, 2024 • edited Loading

Choose a reason for hiding this comment

Michaelvll Dec 7, 2024 • edited Loading

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

cg505 commented Dec 5, 2024 •

edited

Loading

cg505 commented Dec 6, 2024 •

edited

Loading

cg505 Dec 7, 2024 •

edited

Loading

Michaelvll Dec 7, 2024 •

edited

Loading