-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Release] Release 0.7.1 #4438
base: releases/0.7.1_pure
Are you sure you want to change the base?
[Release] Release 0.7.1 #4438
Conversation
…ing (skypilot-org#4264) * fix race condition for setting job status to FAILED during INIT * Fix * fix * format * Add smoke tests * revert pending submit * remove update entirely for the job schedule step * wait for job 32 to finish * fix smoke * move and rename * Add comment * minor
* Avoid job schedule race condition * format * format * Avoid race for cancel
…ounts are specified (skypilot-org#4317) do file mounts if storage is specified
* avoid catching ValueError during failover If the cloud api raises ValueError or a subclass of ValueError during instance termination, we will assume the cluster was downed. Fix this by introducing a new exception ClusterDoesNotExist that we can catch instead of the more general ValueError. * add unit test * lint
…g#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses skypilot-org#4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>
* smoke tests support storage mount only * fix verify command * rename to only_mount
@@ -1144,7 +1144,7 @@ def test_gcp_stale_job_manual_restart(): | |||
# Ensure the skylet updated the stale job status. | |||
_get_cmd_wait_until_job_status_contains_without_matching_job( | |||
cluster_name=name, | |||
job_status=[JobStatus.FAILED.value], | |||
job_status=[JobStatus.FAILED], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this kind of hot fixes, we may want to include it in master and cherry pick it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's due to a merge conflict. The master branch's value is FAILED_DRIVER
, which does not exist in version 0.7.1 but is correct in the master branch.
Based on releases/0.7.0, cherry-picks all commits from 0.7.1, with minor changes only to
smoke_tests.py
to ensure more smoke tests pass and Buildkite works.Smoke tests:
Use buildkite CI to run the following tests:
pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py --gcp
pytest tests/test_smoke.py --azure
pytest tests/test_smoke.py --kubernetes
All passes except the failures:
You can view by clicking the failure from buildkite:
Manual tests:
docs/build/index.html
, scroll over “CLI Reference” (ideally, every page) to see if there are missing sections (we once caught the CLI page completely missing due to an import error; and once it has weird blockquotes displayed)sky -v
backward_compatibility_tests.sh
run against 0.7.0 on aws, run by buildkitesky launch --num-nodes=75 -c dbg --cpus 2+ --use-spot --down --cloud aws -y
sky show-gpus
manual testsRun a 24-hour+ spot job and ensure it doesn’t OOM
sky spot launch -n test-oom --cloud aws --cpus 2 sleep 1000000000000000