Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Add cluster attributes(autodown, idle-minutes-to-autostop) as annotations to the pod #3870

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Aug 24, 2024

This resolves #3869

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Test if annotations are added and remove as cancelling and reattaching autostop on kind/GKE:
    1. sky launch --cloud kubernetes -c mycluster -i 20 --down -y --num-nodes 3 --cpus=1: confirm if adds Annotations to the head/worker pods.
    2. sky autostop mycluster --cancel: confirm if removes the Annotations from the head/worker pods
    3. sky autostop mycluster -i 20 --down: confirm if adds back the Annotations to the head/worker pods.
    • Test sky launch with --down flag only without -i flag: confirm if Annotations for --down is added as True and Annotations for idle_minutes_to_autostop is added with value of 5 to the pod.
    • Test sky launch with -i flag only without --down flag: confirm if it fails to run as stopping is not supported for k8s.
  • pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials" except the ones that are failing on master branch as well:
    1. test_skyserve_fast_update
    2. test_managed_jobs_storage
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@landscapepainter landscapepainter changed the title [k8s] Add autodown annotations to the pod [k8s] Add cluster attributes(autodown, idle-minutes-to-autostop) as annotations to the pod Aug 24, 2024
@landscapepainter landscapepainter marked this pull request as draft August 24, 2024 05:04
@landscapepainter landscapepainter marked this pull request as ready for review August 25, 2024 01:31
sky/cli.py Outdated Show resolved Hide resolved
@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj This is ready for another look!

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -48,6 +48,7 @@
from sky.provision import instance_setup
from sky.provision import metadata_utils
from sky.provision import provisioner
from sky.provision.kubernetes import utils as kubernetes_utils
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we quickly check if this global import does not break skypilot if kubernetes dependencies are not installed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested by running sky launch --cloud gcp -y from a new env where only pip install -e ".[gcp]" is set, and there was no issue raised.

@landscapepainter landscapepainter added this pull request to the merge queue Sep 26, 2024
Merged via the queue into skypilot-org:master with commit d4f96e6 Sep 26, 2024
20 checks passed
@landscapepainter landscapepainter deleted the k8s-add-autodown-annotations branch September 26, 2024 02:06
asaiacai added a commit to asaiacai/zep that referenced this pull request Sep 27, 2024
* [LLM] Update qwen examples (skypilot-org#3957)

* update qwen examples

* Fix misalign

* Qwen 2.5 support (skypilot-org#3959)

* Update qwen example for 2.5 release

* Add support for qwen 2.5 example

* Qwen 2.5 k8s (skypilot-org#3960)

* Update qwen example for 2.5 release

* Add support for qwen 2.5 example

* add kubernetes

* Integrating the Yi series models (skypilot-org#3958)

* Add files via upload

* Update and rename qwen2-7b.yaml to yi15-6b.yaml

* Add files via upload

* Update yi15-9b.yaml

* Update yi15-34b.yaml

* Update yi15-6b.yaml

* Add files via upload

* Update yicoder-1_5b.yaml

* Update yicoder-9b.yaml

* Add files via upload

* Update yi15-34b.yaml

* Update yi15-6b.yaml

* Update yi15-9b.yaml

* Update yicoder-1_5b.yaml

* Update yicoder-9b.yaml

* [Test] Fix Smoke Test `test-skyserve-fast-update` (skypilot-org#3956)

* init

* add newline

* [LLM] Add Qwen2-VL multimodal example (skypilot-org#3961)

Add multimodal example

* Update README.md  (skypilot-org#3969)

* Add files via upload

* Update and rename qwen2-7b.yaml to yi15-6b.yaml

* Add files via upload

* Update yi15-9b.yaml

* Update yi15-34b.yaml

* Update yi15-6b.yaml

* Add files via upload

* Update yicoder-1_5b.yaml

* Update yicoder-9b.yaml

* Add files via upload

* Update yi15-34b.yaml

* Update yi15-6b.yaml

* Update yi15-9b.yaml

* Update yicoder-1_5b.yaml

* Update yicoder-9b.yaml

* Update README.md

* [Core] Admin policy enforcement plugin (skypilot-org#3966)

* support policy hook

* test task labels

* Add test for policy that sets labels

* Fix comment

* format

* use -e to make test related files visible

* Add config.rst

* Fix test

* fix config rst

* Apply policy to service

* add policy for serving

* Add docs

* fix

* format

* Update interface

* fix

* Fix

* fix

* Fix test config

* Fix mutated config

* fix

* Add policy doc

* rename

* minor

* Add additional arguments for autostop

* fix mypy

* format

* rejected message

* format

* Update sky/utils/policy_utils.py

Co-authored-by: Zongheng Yang <[email protected]>

* Update sky/utils/policy_utils.py

Co-authored-by: Zongheng Yang <[email protected]>

* Fix

* Update examples/admin_policy/example_policy/example_policy/__init__.py

Co-authored-by: Zongheng Yang <[email protected]>

* Update docs/source/reference/config.rst

Co-authored-by: Zongheng Yang <[email protected]>

* Address comments

* format

* changes in examples

* Fix enforce autostop

* Fix autostop enforcement

* fix test

* Update docs/source/cloud-setup/policy.rst

Co-authored-by: Zongheng Yang <[email protected]>

* Update sky/admin_policy.py

Co-authored-by: Zongheng Yang <[email protected]>

* Update sky/admin_policy.py

Co-authored-by: Zongheng Yang <[email protected]>

* wip

* Update docs/source/cloud-setup/policy.rst

Co-authored-by: Zongheng Yang <[email protected]>

* Update docs/source/cloud-setup/policy.rst

Co-authored-by: Zongheng Yang <[email protected]>

* Update docs/source/cloud-setup/policy.rst

Co-authored-by: Zongheng Yang <[email protected]>

* fix

* fix

* fix

* Use sky.status for autostop

* update policy

* Update docs/source/cloud-setup/policy.rst

Co-authored-by: Zongheng Yang <[email protected]>

* fix policy.rst

* Add comment

* Fix logging

* fix CI

* Update docs/source/cloud-setup/policy.rst

Co-authored-by: Zongheng Yang <[email protected]>

* Use sphnix inline code

* Add comment

* fix skypilot config file mounts for jobs and serve

---------

Co-authored-by: Zongheng Yang <[email protected]>

* [k8s] Autodown Serve controller on Kubernetes (skypilot-org#3984)

* Add autodown for skyserve on k8s

* lint

* [Tests] Add missing changes from skypilot-org#3966 for fast service update test (skypilot-org#3976)

Use wget instead of git clone for faster downloading

* [Paperspace] add A4000, P4000, GPU+ (skypilot-org#3991)

add A4000, P4000, GPU+

* [Docs] Fix highlighting in code block (skypilot-org#3994)

Fix highlighting in code block

Fixes skypilot-org#3993

* [LLM] Llama 3.2 guide (skypilot-org#3990)

* Add llama 3.2 example

* update

* length

* fix

* update

* update cpus limit

* Use 11B instead for better performance

* update

* update

* Add link

* Fix reference

* Fix vllm version

* Update llm/llama-3_2/README.md

Co-authored-by: Zongheng Yang <[email protected]>

* Update llm/llama-3_2/README.md

Co-authored-by: Zongheng Yang <[email protected]>

* Update llm/llama-3_2/README.md

Co-authored-by: Zongheng Yang <[email protected]>

* Update llm/llama-3_2/README.md

Co-authored-by: Zongheng Yang <[email protected]>

* Fix title

* news

* no need to pin transformers

* remove cover photo for now

---------

Co-authored-by: Zongheng Yang <[email protected]>

* [k8s] Add cluster attributes(autodown, idle-minutes-to-autostop) as annotations to the pod (skypilot-org#3870)

* add autodown annotations to the k8s pod

* revert kubernetes ray template

* revert backend_utils from invasive approach

* nit

* revert from invasive approaches

* revert

* updated approach

* nit

* nit

* Use constant to represent idle_minutes_to_autostop for cancellation

* revert using constants for cancel

* nit

* nit

* add smoke tests

* Update sky/provision/kubernetes/utils.py

Co-authored-by: Romil Bhardwaj <[email protected]>

* fix comments

* nit

* remove loops and annotate one by one

* format

* update with autodown annotation with context

* format

---------

Co-authored-by: Romil Bhardwaj <[email protected]>

* [Examples] Add airflow example (skypilot-org#3982)

* Airflow example

* Airflow example

* Airflow example

* Airflow example

* wip

* Update airflow examples

* Update airflow examples

* Update airflow examples

* Add to readme

* Add to readme

* Add to readme

* lint

* updates

* less salesy

* comments

* comments

* comments

* [UX] default to minimal logging (no module/line number/timestamp). (skypilot-org#3980)

* [UX] default to minimal logging (no module/line number/timestamp).

* Fix mypy.

* Fix typing

* Update sky/utils/env_options.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/utils/env_options.py

Co-authored-by: Tian Xia <[email protected]>

* Account for debug flag.

* Remove prefixes from docs.

---------

Co-authored-by: Tian Xia <[email protected]>

* Revert "[UX] default to minimal logging (no module/line number/timestamp)." (skypilot-org#4003)

Revert "[UX] default to minimal logging (no module/line number/timestamp). (#…"

This reverts commit b96a5b4.

* [Docs] Clarify k8s private registry usage in docs (skypilot-org#3998)

* Clarify k8s private registry auth in docs.

* comments

* [Docs] Various polishing. (skypilot-org#4002)

* [Docs] Various polishing.

* update

* Reword.

* lint

---------

Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Haijian Wang <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Andy Lee <[email protected]>
Co-authored-by: landscapepainter <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Adding cluster attributes to pods as annotations
2 participants