From a0243e56484797c745f42c04260d34e6d280a384 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Mon, 14 Oct 2024 12:49:45 -0700 Subject: [PATCH] [docs] `sky status --kubernetes` docs (#4064) * observability docs * comments --- .../kubernetes/kubernetes-getting-started.rst | 51 +++++++++++++++++ .../reference/kubernetes/kubernetes-setup.rst | 57 +++++++++++++++++-- 2 files changed, 104 insertions(+), 4 deletions(-) diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst index 4f87c8a6ee7..d7313fba3e2 100644 --- a/docs/source/reference/kubernetes/kubernetes-getting-started.rst +++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst @@ -119,6 +119,57 @@ Once your cluster administrator has :ref:`setup a Kubernetes cluster `_ for easily viewing and managing -SkyPilot tasks running on your cluster. +Below, we provide tips on how to monitor SkyPilot resources on your Kubernetes cluster. + +.. _kubernetes-observability-skystatus: + +List SkyPilot resources across all users +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +We provide a convenience command, :code:`sky status --k8s`, to view the status of all SkyPilot resources in the cluster. + +Unlike :code:`sky status` which lists only the SkyPilot resources launched by the current user, +:code:`sky status --k8s` lists all SkyPilot resources in the cluster across all users. + +.. code-block:: console + + $ sky status --k8s + Kubernetes cluster state (context: mycluster) + SkyPilot clusters + USER NAME LAUNCHED RESOURCES STATUS + alice infer-svc-1 23 hrs ago 1x Kubernetes(cpus=1, mem=1, {'L4': 1}) UP + alice sky-jobs-controller-80b50983 2 days ago 1x Kubernetes(cpus=4, mem=4) UP + alice sky-serve-controller-80b50983 23 hrs ago 1x Kubernetes(cpus=4, mem=4) UP + bob dev 1 day ago 1x Kubernetes(cpus=2, mem=8, {'H100': 1}) UP + bob multinode-dev 1 day ago 2x Kubernetes(cpus=2, mem=2) UP + bob sky-jobs-controller-2ea485ea 2 days ago 1x Kubernetes(cpus=4, mem=4) UP + + Managed jobs + In progress tasks: 1 STARTING + USER ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS + alice 1 - eval 1x[CPU:1+] 2 days ago 49s 8s 0 SUCCEEDED + bob 4 - pretrain 1x[H100:4] 1 day ago 1h 1m 11s 1h 14s 0 SUCCEEDED + bob 3 - bigjob 1x[CPU:16] 1 day ago 1d 21h 11m 4s - 0 STARTING + bob 2 - failjob 1x[CPU:1+] 1 day ago 54s 9s 0 FAILED + bob 1 - shortjob 1x[CPU:1+] 2 days ago 1h 1m 19s 1h 16s 0 SUCCEEDED + + +.. _kubernetes-observability-dashboard: + +Kubernetes Dashboard +^^^^^^^^^^^^^^^^^^^^ +You can deploy tools such as the `Kubernetes dashboard `_ to easily view and manage +SkyPilot resources on your cluster. .. image:: ../../images/screenshots/kubernetes/kubernetes-dashboard.png :width: 80%