Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release] Release 0.7.1 #4438

Open
wants to merge 14 commits into
base: releases/0.7.1_pure
Choose a base branch
from

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Dec 4, 2024

Based on releases/0.7.0, cherry-picks all commits from 0.7.1, with minor changes only to smoke_tests.py to ensure more smoke tests pass and Buildkite works.

Smoke tests:

Use buildkite CI to run the following tests:

  • pytest tests/test_smoke.py --aws
  • pytest tests/test_smoke.py --gcp
  • pytest tests/test_smoke.py --azure
  • pytest tests/test_smoke.py --kubernetes

All passes except the failures:

pytest tests/test_smoke.py::test_tpu_vm_pod --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::test_tpu_vm --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --azure --- credential issue, Permission denied by location policies.
pytest tests/test_smoke.py::test_gcp_force_enable_external_ips --gcp --- ssh fail on provision, even on master
pytest tests/test_smoke.py::test_managed_jobs_storage --azure --- ? Fail to provision ?
pytest tests/test_smoke.py::test_azure_best_tier_failover --azure --- ResourcesUnavailableError
pytest tests/test_smoke.py::test_file_mounts --azure ---Failed to run command before rsync ?
pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
pytest tests/test_smoke.py::test_azure_disk_tier --azure --- ResourcesUnavailableError
pytest tests/test_smoke.py::test_kubernetes_context_failover --kubernetes --- Resource limit, no h100
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --aws --- Permission denied by location policies

You can view by clicking the failure from buildkite:
image

Manual tests:

  • locally build docs, open docs/build/index.html, scroll over “CLI Reference” (ideally, every page) to see if there are missing sections (we once caught the CLI page completely missing due to an import error; and once it has weird blockquotes displayed)
  • Check sky -v
  • backward_compatibility_tests.sh run against 0.7.0 on aws, run by buildkite
  • Run manual stress tests (see subsection below)
    • following script
      sky jobs launch --gpus A100:8 --cloud aws echo hi -y
      # Check we are properly failing over the zones:
      sky jobs logs --controller
      
    • following script (Fail due to resource unavaliable)
      sky launch -c dbg --cloud aws --num-nodes 16 --gpus T4 --down --use-spot 
      sky down dbg
      
    • sky launch --num-nodes=75 -c dbg --cpus 2+ --use-spot --down --cloud aws -y
    • many jobs
# Launching many jobs on a cluster
sky launch -c test-many-jobs --cloud aws --cpus 16 --region us-east-1
python3 -c "
import subprocess
from multiprocessing.pool import ThreadPool

def run_task(task):
    print(f'Running task {task}')
    subprocess.run(f'sky exec test-many-jobs -d \"echo hi {task}; sleep 60\"', shell=True)

pool = ThreadPool(8)
pool.map(run_task, range(1000))
"
# Test the job queue on cluster is correct
sky queue test-many-jobs
  • sky show-gpus manual tests

  • Run a 24-hour+ spot job and ensure it doesn’t OOM
    sky spot launch -n test-oom --cloud aws --cpus 2 sleep 1000000000000000

Michaelvll and others added 9 commits December 4, 2024 18:34
…ing (skypilot-org#4264)

* fix race condition for setting job status to FAILED during INIT

* Fix

* fix

* format

* Add smoke tests

* revert pending submit

* remove update entirely for the job schedule step

* wait for job 32 to finish

* fix smoke

* move and rename

* Add comment

* minor
* Avoid job schedule race condition

* format

* format

* Avoid race for cancel
…ounts are specified (skypilot-org#4317)

do file mounts if storage is specified
* avoid catching ValueError during failover

If the cloud api raises ValueError or a subclass of ValueError during instance
termination, we will assume the cluster was downed. Fix this by introducing a
new exception ClusterDoesNotExist that we can catch instead of the more general
ValueError.

* add unit test

* lint
@zpoint zpoint changed the title [Release] Release 0.7.0 [Release] Release 0.7.1 Dec 4, 2024
@zpoint zpoint changed the base branch from releases/0.7.1 to releases/0.7.1_pure December 4, 2024 10:51
cg505 and others added 3 commits December 9, 2024 10:58
…g#4443)

* if a newly-created cluster is missing from the cloud, wait before deleting

Addresses skypilot-org#4431.

* confirm cluster actually terminates before deleting from the db

* avoid deleting cluster data outside the primary provision loop

* tweaks

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* use usage_intervals for new cluster detection

get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.

* fix terminating/stopping state for Lambda and Paperspace

* Revert "use usage_intervals for new cluster detection"

This reverts commit aa6d2e9.

* check cloud.STATUS_VERSION before calling query_instances

* avoid try/catch when querying instances

* update comments

---------

Co-authored-by: Zhanghao Wu <[email protected]>
* smoke tests support storage mount only

* fix verify command

* rename to only_mount
@zpoint zpoint requested a review from Michaelvll December 10, 2024 04:30
@romilbhardwaj romilbhardwaj self-requested a review December 10, 2024 18:25
@@ -1144,7 +1144,7 @@ def test_gcp_stale_job_manual_restart():
# Ensure the skylet updated the stale job status.
_get_cmd_wait_until_job_status_contains_without_matching_job(
cluster_name=name,
job_status=[JobStatus.FAILED.value],
job_status=[JobStatus.FAILED],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this kind of hot fixes, we may want to include it in master and cherry pick it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to a merge conflict. The master branch's value is FAILED_DRIVER, which does not exist in version 0.7.1 but is correct in the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants