-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continue storage deletion when some fail #4454
base: master
Are you sure you want to change the base?
Conversation
@cblmemo PTAL, thanks! |
I inserted five new records into the 'global_user_state' database: three records containing the previous storage handle value, and two records with null handle values. andyl@DESKTOP-7FP6SMO ~/skypilot (fix-storage-recover) [1]> sky storage dele
te -a
Deleting 6 storages: skypilot-workdir-andyl-74d418df, test-storage-1, test-storage-2, test-storage-3, test-storage-4, test-storage-5. Proceed? [Y/n]:
Error deleting storage test-storage-3: In global_user_state, storage name 'test-storage-3' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-2: In global_user_state, storage name 'test-storage-2' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-4: Storage name 'test-storage-4' not found.
Error deleting storage test-storage-1: In global_user_state, storage name 'test-storage-1' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-5: Storage name 'test-storage-5' not found.
Deleted GCS bucket skypilot-workdir-andyl-74d418df. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt fix @andylizf ! left some discussions.
""" | ||
# Reference: https://stackoverflow.com/questions/25790279/python-multiprocessing-early-termination # pylint: disable=line-too-long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not removing this reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean why remove the reference? I think it's trivial now, and the reference doesn't add much value to the code.
sky/utils/subprocess_utils.py
Outdated
def run_in_parallel(func: Callable, | ||
args: Iterable[Any], | ||
num_threads: Optional[int] = None) -> List[Any]: | ||
num_threads: Optional[int] = None, | ||
continue_on_error: bool = False) -> List[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this seems change the return value type from Value
to Union[Value, Exception]
, which looks a little bit strange to me. Why do we need this anyway? aren't we already add the try-except in the func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I thought about not adding a function to catch and log these exceptions, but instead handle all exceptions in run_in_parallel
. Anyway, it made sense to have this function have an argument like continue_on_error
.
But then I realized it violated the abstraction concept and would make it harder to print custom logs. So I ended up wrapping the function and logging it.
Fix #4050.
Make
sky storage delete
continue deleting remaining storages when some fail, instead of aborting the entire operation. Failed deletions are reported individually.We Implemented this feature via adding a
continue_on_error
parameter inrun_in_parallel
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh