[Release] Release 0.7.1 #4438

zpoint · 2024-12-04T10:48:25Z

Based on releases/0.7.0, cherry-picks all commits from 0.7.1, with minor changes only to smoke_tests.py to ensure more smoke tests pass and Buildkite works.

Smoke tests:

Use buildkite CI to run the following tests:

pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py --gcp
pytest tests/test_smoke.py --azure
pytest tests/test_smoke.py --kubernetes

All passes except the failures:

pytest tests/test_smoke.py::test_tpu_vm_pod --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::test_tpu_vm --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --azure --- credential issue, Permission denied by location policies.
pytest tests/test_smoke.py::test_gcp_force_enable_external_ips --gcp --- ssh fail on provision, even on master
pytest tests/test_smoke.py::test_managed_jobs_storage --azure --- ? Fail to provision ?
pytest tests/test_smoke.py::test_azure_best_tier_failover --azure --- ResourcesUnavailableError
pytest tests/test_smoke.py::test_file_mounts --azure ---Failed to run command before rsync ?
pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
pytest tests/test_smoke.py::test_azure_disk_tier --azure --- ResourcesUnavailableError
pytest tests/test_smoke.py::test_kubernetes_context_failover --kubernetes --- Resource limit, no h100
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --aws --- Permission denied by location policies

You can view by clicking the failure from buildkite:

Manual tests:

locally build docs, open docs/build/index.html, scroll over “CLI Reference” (ideally, every page) to see if there are missing sections (we once caught the CLI page completely missing due to an import error; and once it has weird blockquotes displayed)
Check sky -v
backward_compatibility_tests.sh run against 0.7.0 on aws, run by buildkite

Run manual stress tests (see subsection below)

following script

sky jobs launch --gpus A100:8 --cloud aws echo hi -y
# Check we are properly failing over the zones:
sky jobs logs --controller

following script (Fail due to resource unavaliable)

sky launch -c dbg --cloud aws --num-nodes 16 --gpus T4 --down --use-spot 
sky down dbg

sky launch --num-nodes=75 -c dbg --cpus 2+ --use-spot --down --cloud aws -y
many jobs

# Launching many jobs on a cluster
sky launch -c test-many-jobs --cloud aws --cpus 16 --region us-east-1
python3 -c "
import subprocess
from multiprocessing.pool import ThreadPool

def run_task(task):
    print(f'Running task {task}')
    subprocess.run(f'sky exec test-many-jobs -d \"echo hi {task}; sleep 60\"', shell=True)

pool = ThreadPool(8)
pool.map(run_task, range(1000))
"
# Test the job queue on cluster is correct
sky queue test-many-jobs

sky show-gpus manual tests
Run a 24-hour+ spot job and ensure it doesn’t OOM
sky spot launch -n test-oom --cloud aws --cpus 2 sleep 1000000000000000

…ing (skypilot-org#4264) * fix race condition for setting job status to FAILED during INIT * Fix * fix * format * Add smoke tests * revert pending submit * remove update entirely for the job schedule step * wait for job 32 to finish * fix smoke * move and rename * Add comment * minor

* Avoid job schedule race condition * format * format * Avoid race for cancel

…ounts are specified (skypilot-org#4317) do file mounts if storage is specified

* avoid catching ValueError during failover If the cloud api raises ValueError or a subclass of ValueError during instance termination, we will assume the cluster was downed. Fix this by introducing a new exception ClusterDoesNotExist that we can catch instead of the more general ValueError. * add unit test * lint

…g#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses skypilot-org#4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>

* smoke tests support storage mount only * fix verify command * rename to only_mount

Michaelvll · 2024-12-10T18:32:13Z

tests/test_smoke.py

@@ -1144,7 +1144,7 @@ def test_gcp_stale_job_manual_restart():
            # Ensure the skylet updated the stale job status.
            _get_cmd_wait_until_job_status_contains_without_matching_job(
                cluster_name=name,
-                job_status=[JobStatus.FAILED.value],
+                job_status=[JobStatus.FAILED],


For this kind of hot fixes, we may want to include it in master and cherry pick it?

It's due to a merge conflict. The master branch's value is FAILED_DRIVER, which does not exist in version 0.7.1 but is correct in the master branch.

Michaelvll and others added 9 commits December 4, 2024 18:34

[Core] Avoid job scheduling race condition (skypilot-org#4310)

0ccab08

* Avoid job schedule race condition * format * format * Avoid race for cancel

merge skypilot-org#4284

dfe43e2

[Storage] Call sync_file_mounts when either rsync or storage file_m…

36b3616

…ounts are specified (skypilot-org#4317) do file mounts if storage is specified

merge skypilot-org#4386

0ece9a8

Fix Spot instance on Azure (skypilot-org#4408)

07f87dd

Fix OD instance on Azure (skypilot-org#4411)

a2ebb05

hot fix to support smoke tests

86a3564

zpoint changed the title ~~[Release] Release 0.7.0~~ [Release] Release 0.7.1 Dec 4, 2024

zpoint changed the base branch from releases/0.7.1 to releases/0.7.1_pure December 4, 2024 10:51

zpoint added 2 commits December 4, 2024 21:03

bump up version

4e42ddd

pytest hot fix from master branch

6653026

zpoint mentioned this pull request Dec 6, 2024

smoke tests support storage mount only #4446

Merged

5 tasks

cg505 and others added 3 commits December 9, 2024 10:58

smoke tests support storage mount only (skypilot-org#4446)

c5ae5ca

* smoke tests support storage mount only * fix verify command * rename to only_mount

hot fix for smoke tests

2448bf6

zpoint requested a review from Michaelvll December 10, 2024 04:30

romilbhardwaj self-requested a review December 10, 2024 18:25

Michaelvll reviewed Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release] Release 0.7.1 #4438

[Release] Release 0.7.1 #4438

zpoint commented Dec 4, 2024 •

edited

Loading

Michaelvll Dec 10, 2024

zpoint Dec 11, 2024

[Release] Release 0.7.1 #4438

Are you sure you want to change the base?

[Release] Release 0.7.1 #4438

Conversation

zpoint commented Dec 4, 2024 • edited Loading

Michaelvll Dec 10, 2024

Choose a reason for hiding this comment

zpoint Dec 11, 2024

Choose a reason for hiding this comment

zpoint commented Dec 4, 2024 •

edited

Loading