Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make --fast robust against credential or wheel updates #4289

Merged
merged 21 commits into from
Dec 4, 2024

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Nov 7, 2024

  • Introduce a "config hash", which is a deterministic hash of the ray cluster config and all the referenced file mounts. We expect this value to stay the same when running launch on an existing cluster if all else is equal.
  • Store the current config hash for each cluster into the global_user_state DB.
  • Add a flag to provisioning that allows the provisioning path to short-circuit if the calculated config hash is the same as the one that should already be present on the cluster.
  • Use this new path for --fast.

The result is that --fast will reprovision the cluster if some important things change (such as new cloud credentials or a new version of the skypilot wheel).

However, this does cause some performance regression in the --fast case since we need to go through the initial provisioning stages. That costs about 4s on my machine. I am looking into optimizing this - it's mostly an unnecessary roundtrip checking that the cluster is up.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Manually test that updating credential files is caught
    • Manually test that new skypilot wheel is caught
    • Test happy path
    • Test for managed jobs controller
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests:
    • pytest tests/test_smoke.py::test_minimal
    • pytest tests/test_smoke.py:: --managed-jobs
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

cg505 added 2 commits November 7, 2024 20:29
This is needed since some files in the fake file mounts don't actually exist,
like the wheel path.
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @cg505! It looks mostly good to me. Left several comments

sky/backends/backend.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
@cg505 cg505 requested a review from Michaelvll November 12, 2024 23:13
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cg505! It looks mostly good to me! Left several comments, mostly nits.

Can we also add a smoke test for --fast?

sky/backends/backend.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates @cg505! It looks mostly good to me.

sky/backends/backend.py Outdated Show resolved Hide resolved
sky/backends/backend.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/global_user_state.py Outdated Show resolved Hide resolved
@cg505 cg505 requested a review from Michaelvll November 16, 2024 03:40
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! It looks mostly good to me. Left some comments mostly for readability and code style.

We should make sure the backward compatibility tests pass. Also, can we add a quick smoke test to test this --fast.

sky/backends/backend.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
@cg505 cg505 requested a review from Michaelvll December 3, 2024 21:27
@cg505 cg505 merged commit 3009204 into skypilot-org:master Dec 4, 2024
19 checks passed
Michaelvll added a commit that referenced this pull request Dec 9, 2024
* [perf] use uv for venv creation and pip install (#4414)

* Revert "remove `uv` from runtime setup due to azure installation issue (#4401)"

This reverts commit 0b20d56.

* on azure, use --prerelease=allow to install azure-cli

* use uv venv --seed

* fix backwards compatibility

* really fix backwards compatibility

* use uv to set up controller dependencies

* fix python 3.8

* lint

* add missing file

* update comment

* split out azure-cli dep

* fix lint for dependencies

* use runpy.run_path rather than modifying sys.path

* fix cloud dependency installation commands

* lint

* Update sky/utils/controller_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

---------

Co-authored-by: Zhanghao Wu <[email protected]>

* [Minor] README updates. (#4436)

* [Minor] README touches.

* update

* update

* make --fast robust against credential or wheel updates (#4289)

* add config_dict['config_hash'] output to write_cluster_config

* fix docstring for write_cluster_config

This used to be true, but since #2943, 'ray' is the only provisioner.
Add other keys that are now present instead.

* when using --fast, check if config_hash matches, and if not, provision

* mock hashing method in unit test

This is needed since some files in the fake file mounts don't actually exist,
like the wheel path.

* check config hash within provision with lock held

* address other PR review comments

* rename to skip_if_no_cluster_updates

Co-authored-by: Zhanghao Wu <[email protected]>

* add assert details

Co-authored-by: Zhanghao Wu <[email protected]>

* address PR comments and update docstrings

* fix test

* update docstrings

Co-authored-by: Zhanghao Wu <[email protected]>

* address PR comments

* fix lint and tests

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zhanghao Wu <[email protected]>

* refactor skip_if_no_cluster_update var

* clarify comment

* format exception

---------

Co-authored-by: Zhanghao Wu <[email protected]>

* format

* format

* format

* fix

---------

Co-authored-by: Christopher Cooper <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
cg505 added a commit to cg505/skypilot that referenced this pull request Dec 12, 2024
The --fast behavior is now always enabled. This was unsafe before but since
\skypilot-org#4289 it should be safe.

We will remove the flag before 0.8.0 so that it never touches a stable version.

sky launch still has the --fast flag. This flag is unsafe because it could cause
setup to be skipped even though it should be re-run. In the managed jobs case,
this is not an issue because we fully control the setup and know it will not
change.
@cg505 cg505 mentioned this pull request Dec 12, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants