diff --git a/README.md b/README.md
index 61085db93c6..1a9ba0e7780 100644
--- a/README.md
+++ b/README.md
@@ -38,30 +38,32 @@ This repository is a fork of the [original Skypilot](https://github.com/skypilot
 
 ----
 :fire: *News* :fire:
-- [Sep, 2024] Point, Launch and Serve **Llama 3.2** on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/)
-- [Sep, 2024] Run and deploy [**Pixtral**](./llm/pixtral), the first open-source multimodal model from Mistral AI.
-- [Jul, 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra
-- [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/)
-- [Apr, 2024] Serve **Qwen-110B** on your infra: [**example**](./llm/qwen/)
-- [Apr, 2024] Using **Ollama** to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/)
-- [Feb, 2024] Deploying and scaling **Gemma** with SkyServe: [**example**](./llm/gemma/)
-- [Feb, 2024] Serving **Code Llama 70B** with vLLM and SkyServe: [**example**](./llm/codellama/)
-- [Dec, 2023] **Mixtral 8x7B**, a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/)
-- [Nov, 2023] Using **Axolotl** to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
+- [Oct 2024] :tada: **SkyPilot crossed 1M+ downloads** :tada:: Thank you to our community! [**Twitter/X**](https://x.com/skypilot_org/status/1844770841718067638)
+- [Sep 2024] Point, Launch and Serve **Llama 3.2** on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/)
+- [Sep 2024] Run and deploy [**Pixtral**](./llm/pixtral), the first open-source multimodal model from Mistral AI.
+- [Jun 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/)
+- [Apr 2024] Serve [**Qwen-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) on your infra: [**example**](./llm/qwen/)
+- [Apr 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/)
+- [Feb 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/)
+- [Feb 2024] Serving [**Code Llama 70B**](https://ai.meta.com/blog/code-llama-large-language-model-coding/) with vLLM and SkyServe: [**example**](./llm/codellama/)
+- [Dec 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/)
+- [Nov 2023] Using [**Axolotl**](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
+
+**LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)
 
 <details>
   <summary>Archived</summary>
 
-- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
-- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
-- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
-- [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
-- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
-- [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
-- [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/)
-- [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
-- [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
-- [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!
+- [Jul 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra
+- [Apr 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
+- [Mar 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
+- [Feb 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
+- [Dec 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
+- [Sep 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
+- [Sep 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
+- [Jul 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
+- [Jun 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
+- [Apr 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!
 
 </details>
 
diff --git a/docs/source/examples/syncing-code-artifacts.rst b/docs/source/examples/syncing-code-artifacts.rst
index ded8d03f739..1b05c68b84f 100644
--- a/docs/source/examples/syncing-code-artifacts.rst
+++ b/docs/source/examples/syncing-code-artifacts.rst
@@ -46,31 +46,7 @@ VMs.  The task is invoked under that working directory (so that it can call
 scripts, access checkpoints, etc.).
 
 .. note::
-
-    **Exclude files from syncing**
-
-    For large, multi-gigabyte workdirs, uploading may be slow because they
-    are synced to the remote VM(s). To exclude large files in
-    your workdir from being uploaded, add them to a :code:`.skyignore` file 
-    under your workdir. :code:`.skyignore` follows RSYNC filter rules. 
-    
-    Example :code:`.skyignore` file:
-
-    .. code-block::
-        
-      # Files that match pattern under ONLY CURRENT directory
-      /hello.py
-      /*.txt
-      /dir
-
-      # Files that match pattern under ALL directories
-      *.txt
-      hello.py
-
-      # Files that match pattern under a directory ./dir/
-      /dir/*.txt
-    
-    Do NOT use ``.`` to indicate local directory (e.g. ``./hello.py``).
+  To exclude large files from being uploaded, see :ref:`exclude-uploading-files`.
 
 .. note::
 
@@ -140,6 +116,33 @@ file_mount may be slow because they are processed by ``rsync``.  Use
 :ref:`SkyPilot bucket mounting <sky-storage>` to efficiently handle
 large files.
 
+.. _exclude-uploading-files:
+
+Exclude uploading files
+--------------------------------------
+By default, SkyPilot uses your existing :code:`.gitignore` and :code:`.git/info/exclude` to exclude files from syncing.
+
+Alternatively, you can use :code:`.skyignore` if you want to separate SkyPilot's syncing behavior from Git's.
+If you use a :code:`.skyignore` file, SkyPilot will only exclude files based on that file without using the default Git files.
+
+Any :code:`.skyignore` file under either your workdir or source paths of file_mounts is respected.
+
+:code:`.skyignore` follows RSYNC filter rules, e.g.
+
+.. code-block::
+
+  # Files that match pattern under CURRENT directory
+  /file.txt
+  /dir
+  /*.jar
+  /dir/*.jar
+
+  # Files that match pattern under ALL directories
+  *.jar
+  file.txt
+
+Do _not_ use ``.`` to indicate local directory (e.g., instead of ``./file``, write ``/file``).
+
 .. _downloading-files-and-artifacts:
 
 Downloading files and artifacts
diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst
index 5c52e7487b9..b8255b46402 100644
--- a/docs/source/reference/config.rst
+++ b/docs/source/reference/config.rst
@@ -419,6 +419,15 @@ Available fields and semantics:
     # Default: 'LOCAL_CREDENTIALS'.
     remote_identity: LOCAL_CREDENTIALS
 
+    # Enable gVNIC (optional).
+    #
+    # Set to true to use gVNIC on GCP instances. gVNIC offers higher performance
+    # for multi-node clusters, but costs more.
+    # Reference: https://cloud.google.com/compute/docs/networking/using-gvnic
+    #
+    # Default: false.
+    enable_gvnic: false
+
   # Advanced Azure configurations (optional).
   # Apply to all new instances but not existing ones.
   azure:
diff --git a/docs/source/reference/kubernetes/kubernetes-deployment.rst b/docs/source/reference/kubernetes/kubernetes-deployment.rst
index d7e7127f6e7..e9489e9149e 100644
--- a/docs/source/reference/kubernetes/kubernetes-deployment.rst
+++ b/docs/source/reference/kubernetes/kubernetes-deployment.rst
@@ -114,9 +114,9 @@ Deploying on Google Cloud GKE
      # Example:
      # gcloud container clusters get-credentials testcluster --region us-central1-c
 
-3. [If using GPUs] If your GKE nodes have GPUs, you may need to to
-   `manually install <https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/>`_
-   nvidia drivers. You can do so by deploying the daemonset
+3. [If using GPUs] For GKE versions newer than 1.30.1-gke.115600, NVIDIA drivers are pre-installed and no additional setup is required. If you are using an older GKE version, you may need to
+   `manually install <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers>`_
+   NVIDIA drivers for GPU support. You can do so by deploying the daemonset
    depending on the GPU and OS on your nodes:
 
    .. code-block:: console
@@ -133,7 +133,8 @@ Deploying on Google Cloud GKE
      # For Ubuntu based nodes with L4 GPUs:
      $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R525.yaml
 
-   To verify if GPU drivers are set up, run ``kubectl describe nodes`` and verify that ``nvidia.com/gpu`` is listed under the ``Capacity`` section.
+   .. tip::
+      To verify if GPU drivers are set up, run ``kubectl describe nodes`` and verify that ``nvidia.com/gpu`` resource is listed under the ``Capacity`` section.
 
 4. Verify your kubernetes cluster is correctly set up for SkyPilot by running :code:`sky check`:
 
diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst
index 4f87c8a6ee7..d7313fba3e2 100644
--- a/docs/source/reference/kubernetes/kubernetes-getting-started.rst
+++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst
@@ -119,6 +119,57 @@ Once your cluster administrator has :ref:`setup a Kubernetes cluster <kubernetes
     $ kubectl config set-context --current --namespace=mynamespace
 
 
+
+Viewing cluster status
+----------------------
+
+To view the status of all SkyPilot resources in the Kubernetes cluster, run :code:`sky status --k8s`.
+
+Unlike :code:`sky status` which lists only the SkyPilot resources launched by the current user,
+:code:`sky status --k8s` lists all SkyPilot resources in the Kubernetes cluster across all users.
+
+.. code-block:: console
+
+    $ sky status --k8s
+    Kubernetes cluster state (context: mycluster)
+    SkyPilot clusters
+    USER     NAME                           LAUNCHED    RESOURCES                                  STATUS
+    alice    infer-svc-1                    23 hrs ago  1x Kubernetes(cpus=1, mem=1, {'L4': 1})    UP
+    alice    sky-jobs-controller-80b50983   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
+    alice    sky-serve-controller-80b50983  23 hrs ago  1x Kubernetes(cpus=4, mem=4)               UP
+    bob      dev                            1 day ago   1x Kubernetes(cpus=2, mem=8, {'H100': 1})  UP
+    bob      multinode-dev                  1 day ago   2x Kubernetes(cpus=2, mem=2)               UP
+    bob      sky-jobs-controller-2ea485ea   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
+
+    Managed jobs
+    In progress tasks: 1 STARTING
+    USER     ID  TASK  NAME      RESOURCES   SUBMITTED   TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
+    alice    1   -     eval      1x[CPU:1+]  2 days ago  49s            8s            0            SUCCEEDED
+    bob      4   -     pretrain  1x[H100:4]  1 day ago   1h 1m 11s      1h 14s        0            SUCCEEDED
+    bob      3   -     bigjob    1x[CPU:16]  1 day ago   1d 21h 11m 4s  -             0            STARTING
+    bob      2   -     failjob   1x[CPU:1+]  1 day ago   54s            9s            0            FAILED
+    bob      1   -     shortjob  1x[CPU:1+]  2 days ago  1h 1m 19s      1h 16s        0            SUCCEEDED
+
+You can also inspect the real-time GPU usage on the cluster with :code:`sky show-gpus --cloud kubernetes`.
+
+.. code-block:: console
+
+    $ sky show-gpus --cloud kubernetes
+    Kubernetes GPUs
+    GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+    L4    1, 2, 4       12          12
+    H100  1, 2, 4, 8    16          16
+
+    Kubernetes per node GPU availability
+    NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
+    my-cluster-0               L4        4           4
+    my-cluster-1               L4        4           4
+    my-cluster-2               L4        2           2
+    my-cluster-3               L4        2           2
+    my-cluster-4               H100      8           8
+    my-cluster-5               H100      8           8
+
+
 .. _kubernetes-custom-images:
 
 Using Custom Images
diff --git a/docs/source/reference/kubernetes/kubernetes-ports.rst b/docs/source/reference/kubernetes/kubernetes-ports.rst
index 3824b651717..4f8476c1bbc 100644
--- a/docs/source/reference/kubernetes/kubernetes-ports.rst
+++ b/docs/source/reference/kubernetes/kubernetes-ports.rst
@@ -59,40 +59,18 @@ To restrict your services to be accessible only within the cluster, you can set
 
 Depending on your cloud, set the appropriate annotation in the SkyPilot config file (``~/.sky/config.yaml``):
 
-.. tab-set::
-
-    .. tab-item:: GCP
-        :sync: internal-lb-gke
-
-        .. code-block:: yaml
-
-          # ~/.sky/config.yaml
-          kubernetes:
-            custom_metadata:
-                annotations:
-                   networking.gke.io/load-balancer-type: "Internal"
-
-    .. tab-item:: AWS
-        :sync: internal-lb-aws
-
-        .. code-block:: yaml
-
-          # ~/.sky/config.yaml
-          kubernetes:
-            custom_metadata:
-                annotations:
-                  service.beta.kubernetes.io/aws-load-balancer-internal: "true"
-
-    .. tab-item:: Azure
-        :sync: internal-lb-azure
-
-        .. code-block:: yaml
+.. code-block:: yaml
 
-          # ~/.sky/config.yaml
-          kubernetes:
-            custom_metadata:
-                annotations:
-                  service.beta.kubernetes.io/azure-load-balancer-internal: "true"
+    # ~/.sky/config.yaml
+    kubernetes:
+      custom_metadata:
+        annotations:
+          # For GCP/GKE
+          networking.gke.io/load-balancer-type: "Internal"
+          # For AWS/EKS
+          service.beta.kubernetes.io/aws-load-balancer-internal: "true"
+          # For Azure/AKS
+          service.beta.kubernetes.io/azure-load-balancer-internal: "true"
 
 
 .. _kubernetes-ingress:
diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst
index 7bf04f3a7a9..6ae8d7e61f6 100644
--- a/docs/source/reference/kubernetes/kubernetes-setup.rst
+++ b/docs/source/reference/kubernetes/kubernetes-setup.rst
@@ -261,9 +261,19 @@ You can also check the GPUs available on your nodes by running:
 .. code-block:: console
 
     $ sky show-gpus --cloud kubernetes
+    Kubernetes GPUs
     GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-    L4    1, 2, 3, 4    8           6
-    H100  1, 2          4           2
+    L4    1, 2, 4       12          12
+    H100  1, 2, 4, 8    16          16
+
+    Kubernetes per node GPU availability
+    NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
+    my-cluster-0               L4        4           4
+    my-cluster-1               L4        4           4
+    my-cluster-2               L4        2           2
+    my-cluster-3               L4        2           2
+    my-cluster-4               H100      8           8
+    my-cluster-5               H100      8           8
 
 
 .. _kubernetes-observability:
@@ -274,8 +284,47 @@ All SkyPilot tasks are run in pods inside a Kubernetes cluster. As a cluster adm
 you can inspect running pods (e.g., with :code:`kubectl get pods -n namespace`) to check which
 tasks are running and how many resources they are consuming on the cluster.
 
-Additionally, you can also deploy tools such as the `Kubernetes dashboard <https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/>`_ for easily viewing and managing
-SkyPilot tasks running on your cluster.
+Below, we provide tips on how to monitor SkyPilot resources on your Kubernetes cluster.
+
+.. _kubernetes-observability-skystatus:
+
+List SkyPilot resources across all users
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We provide a convenience command, :code:`sky status --k8s`, to view the status of all SkyPilot resources in the cluster.
+
+Unlike :code:`sky status` which lists only the SkyPilot resources launched by the current user,
+:code:`sky status --k8s` lists all SkyPilot resources in the cluster across all users.
+
+.. code-block:: console
+
+    $ sky status --k8s
+    Kubernetes cluster state (context: mycluster)
+    SkyPilot clusters
+    USER     NAME                           LAUNCHED    RESOURCES                                  STATUS
+    alice    infer-svc-1                    23 hrs ago  1x Kubernetes(cpus=1, mem=1, {'L4': 1})    UP
+    alice    sky-jobs-controller-80b50983   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
+    alice    sky-serve-controller-80b50983  23 hrs ago  1x Kubernetes(cpus=4, mem=4)               UP
+    bob      dev                            1 day ago   1x Kubernetes(cpus=2, mem=8, {'H100': 1})  UP
+    bob      multinode-dev                  1 day ago   2x Kubernetes(cpus=2, mem=2)               UP
+    bob      sky-jobs-controller-2ea485ea   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
+
+    Managed jobs
+    In progress tasks: 1 STARTING
+    USER     ID  TASK  NAME      RESOURCES   SUBMITTED   TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
+    alice    1   -     eval      1x[CPU:1+]  2 days ago  49s            8s            0            SUCCEEDED
+    bob      4   -     pretrain  1x[H100:4]  1 day ago   1h 1m 11s      1h 14s        0            SUCCEEDED
+    bob      3   -     bigjob    1x[CPU:16]  1 day ago   1d 21h 11m 4s  -             0            STARTING
+    bob      2   -     failjob   1x[CPU:1+]  1 day ago   54s            9s            0            FAILED
+    bob      1   -     shortjob  1x[CPU:1+]  2 days ago  1h 1m 19s      1h 16s        0            SUCCEEDED
+
+
+.. _kubernetes-observability-dashboard:
+
+Kubernetes Dashboard
+^^^^^^^^^^^^^^^^^^^^
+You can deploy tools such as the `Kubernetes dashboard <https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/>`_ to easily view and manage
+SkyPilot resources on your cluster.
 
 .. image:: ../../images/screenshots/kubernetes/kubernetes-dashboard.png
     :width: 80%
diff --git a/docs/source/reference/yaml-spec.rst b/docs/source/reference/yaml-spec.rst
index c5339bcc184..f874b4d37b4 100644
--- a/docs/source/reference/yaml-spec.rst
+++ b/docs/source/reference/yaml-spec.rst
@@ -22,8 +22,8 @@ Available fields:
     # If a relative path is used, it's evaluated relative to the location from 
     # which `sky` is called.
     #
-    # To exclude files from syncing, add them to a .skyignore file under your working directory.
-    # Details: https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#uploading-code-and-project-files
+    # To exclude files from syncing, see 
+    # https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files
     workdir: ~/my-task-code
 
     # Number of nodes (optional; defaults to 1) to launch including the head node.
diff --git a/examples/deepspeed-multinode/sky.yaml b/examples/deepspeed-multinode/sky.yaml
index 37d7445a2a1..07bd3746894 100644
--- a/examples/deepspeed-multinode/sky.yaml
+++ b/examples/deepspeed-multinode/sky.yaml
@@ -2,10 +2,16 @@
 #
 # This takes care constructing a "hostfile" to pass to DeepSpeed.
 #
+# If running on Kubernetes, use the nvidia/cuda:12.1.1-devel-ubuntu20.04 image
+# because DeepSpeed requires nvcc.
+#
 # Usage:
 #
 #   $ sky launch sky.yaml -r --down -c ds
 #
+#   If running on Kubernetes:
+#   $ sky launch sky.yaml -r --down -c ds --cloud kubernetes --image nvidia/cuda:12.1.1-devel-ubuntu20.04
+#
 #   # Optional: After the job starts running, you can log into the two nodes and
 #   # check gpustat:
 #   $ ssh ds
@@ -18,6 +24,7 @@ resources:
   # accelerators: A100-80GB:1  # Azure, GCP, SCP
   # accelerators: A10G:1  # AWS. Will OOM for (1) single_node/run_1.3b_lora.sh (2) multi_node/run_66b.sh.
   # accelerators: T4:1  # AWS, Azure, GCP. Will OOM for (1) single_node/run_1.3b_lora.sh (2) multi_node/run_66b.sh.
+  # image_id: docker:nvidia/cuda:12.1.1-devel-ubuntu20.04  # Use this image if running on Kubernetes
 
 num_nodes: 2
 
@@ -28,6 +35,13 @@ envs:
   DEEPSPEED_ENVS: "MY_VAR_1,MY_VAR_2,SKYPILOT_NODE_RANK"
 
 setup: |
+  if ! command -v git &> /dev/null
+  then
+      echo "git is not installed. Installing git..."
+      sudo apt-get update
+      sudo apt-get install -y git
+  fi
+  
   git clone https://github.com/microsoft/DeepSpeedExamples.git || true
   cd DeepSpeedExamples
   git checkout d7c42b4f34df91035e7ed3e0c51500bb53d0bc71
@@ -39,16 +53,19 @@ setup: |
     conda create -n deepspeed python=3.8 -y
     conda activate deepspeed
 
-    pip install deepspeed
+    pip install deepspeed==0.14.4
 
     cd applications/DeepSpeed-Chat
     pip install -r requirements.txt
+  
+    pip install transformers==4.44.0
 
     # Required by DeepSpeed in multi-node settings.
     #
     # NOTE(skypilot): DeepSpeed uses `pdsh` to log into each node and calls
     # `ninja --version`; so it has to be installed system-wide rather than in
     # the above 'deepspeed' conda env.
+    sudo apt-get update
     sudo apt-get -y install pdsh ninja-build
   fi
 
diff --git a/sky/authentication.py b/sky/authentication.py
index eb51aad02ad..41a7d02dfb7 100644
--- a/sky/authentication.py
+++ b/sky/authentication.py
@@ -43,9 +43,9 @@
 from sky.adaptors import ibm
 from sky.adaptors import kubernetes
 from sky.adaptors import runpod
-from sky.clouds.utils import lambda_utils
 from sky.provision.fluidstack import fluidstack_utils
 from sky.provision.kubernetes import utils as kubernetes_utils
+from sky.provision.lambda_cloud import lambda_utils
 from sky.utils import common_utils
 from sky.utils import kubernetes_enums
 from sky.utils import subprocess_utils
diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py
index 1f213f5c614..caa6c9292d5 100644
--- a/sky/backends/backend_utils.py
+++ b/sky/backends/backend_utils.py
@@ -56,7 +56,7 @@
 from sky.utils import ux_utils
 
 if typing.TYPE_CHECKING:
-    from sky import resources
+    from sky import resources as resources_lib
     from sky import task as task_lib
     from sky.backends import cloud_vm_ray_backend
     from sky.backends import local_docker_backend
@@ -751,7 +751,7 @@ def _restore_block(new_block: Dict[str, Any], old_block: Dict[str, Any]):
 # TODO: too many things happening here - leaky abstraction. Refactor.
 @timeline.event
 def write_cluster_config(
-        to_provision: 'resources.Resources',
+        to_provision: 'resources_lib.Resources',
         num_nodes: int,
         cluster_config_template: str,
         cluster_name: str,
@@ -2772,6 +2772,10 @@ def get_endpoints(cluster: str,
     cluster_records = get_clusters(include_controller=True,
                                    refresh=False,
                                    cluster_names=[cluster])
+    if not cluster_records:
+        with ux_utils.print_exception_no_traceback():
+            raise exceptions.ClusterNotUpError(
+                f'Cluster {cluster!r} not found.', cluster_status=None)
     assert len(cluster_records) == 1, cluster_records
     cluster_record = cluster_records[0]
     if (not skip_status_check and
diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py
index d0ba3b2bc8f..b4ab1644401 100644
--- a/sky/backends/cloud_vm_ray_backend.py
+++ b/sky/backends/cloud_vm_ray_backend.py
@@ -2849,9 +2849,9 @@ def _provision(
                         time.sleep(gap_seconds)
                         continue
                     logger.error(
-                        f'{colorama.Fore.RED}⨯{colorama.Style.RESET_ALL} '
-                        'Failed to provision resources. '
-                        f'{ux_utils.log_path_hint(log_path)}')
+                        ux_utils.error_message(
+                            'Failed to provision resources. '
+                            f'{ux_utils.log_path_hint(log_path)}'))
                     error_message += (
                         '\nTo keep retrying until the cluster is up, use '
                         'the `--retry-until-up` flag.')
diff --git a/sky/cli.py b/sky/cli.py
index 87d35f58d1c..fb5a38bba7b 100644
--- a/sky/cli.py
+++ b/sky/cli.py
@@ -1464,54 +1464,8 @@ def _status_kubernetes(show_all: bool):
     Args:
         show_all (bool): Show all job information (e.g., start time, failures).
     """
-    context = kubernetes_utils.get_current_kube_config_context_name()
-    try:
-        pods = kubernetes_utils.get_skypilot_pods(context)
-    except exceptions.ResourcesUnavailableError as e:
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError('Failed to get SkyPilot pods from '
-                             f'Kubernetes: {str(e)}') from e
-    all_clusters, jobs_controllers, serve_controllers = (
-        status_utils.process_skypilot_pods(pods, context))
-    all_jobs = []
-    with rich_utils.safe_status(
-            '[bold cyan]Checking in-progress managed jobs[/]') as spinner:
-        for i, (_, job_controller_info) in enumerate(jobs_controllers.items()):
-            user = job_controller_info['user']
-            pod = job_controller_info['pods'][0]
-            status_message = ('[bold cyan]Checking managed jobs controller')
-            if len(jobs_controllers) > 1:
-                status_message += f's ({i+1}/{len(jobs_controllers)})'
-            spinner.update(f'{status_message}[/]')
-            try:
-                job_list = managed_jobs.queue_from_kubernetes_pod(
-                    pod.metadata.name)
-            except RuntimeError as e:
-                logger.warning('Failed to get managed jobs from controller '
-                               f'{pod.metadata.name}: {str(e)}')
-                job_list = []
-            # Add user field to jobs
-            for job in job_list:
-                job['user'] = user
-            all_jobs.extend(job_list)
-    # Reconcile cluster state between managed jobs and clusters:
-    # To maintain a clear separation between regular SkyPilot clusters
-    # and those from managed jobs, we need to exclude the latter from
-    # the main cluster list.
-    # We do this by reconstructing managed job cluster names from each
-    # job's name and ID. We then use this set to filter out managed
-    # clusters from the main cluster list. This is necessary because there
-    # are no identifiers distinguishing clusters from managed jobs from
-    # regular clusters.
-    managed_job_cluster_names = set()
-    for job in all_jobs:
-        # Managed job cluster name is <job_name>-<job_id>
-        managed_cluster_name = f'{job["job_name"]}-{job["job_id"]}'
-        managed_job_cluster_names.add(managed_cluster_name)
-    unmanaged_clusters = [
-        c for c in all_clusters
-        if c['cluster_name'] not in managed_job_cluster_names
-    ]
+    all_clusters, unmanaged_clusters, all_jobs, context = (
+        core.status_kubernetes())
     click.echo(f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
                f'Kubernetes cluster state (context: {context})'
                f'{colorama.Style.RESET_ALL}')
@@ -1523,7 +1477,7 @@ def _status_kubernetes(show_all: bool):
                    f'{colorama.Style.RESET_ALL}')
         msg = managed_jobs.format_job_table(all_jobs, show_all=show_all)
         click.echo(msg)
-    if serve_controllers:
+    if any(['sky-serve-controller' in c.cluster_name for c in all_clusters]):
         # TODO: Parse serve controllers and show services separately.
         #  Currently we show a hint that services are shown as clusters.
         click.echo(f'\n{colorama.Style.DIM}Hint: SkyServe replica pods are '
@@ -4426,9 +4380,14 @@ def serve_status(all: bool, endpoint: bool, service_names: List[str]):
               default=False,
               required=False,
               help='Skip confirmation prompt.')
+@click.option('--replica-id',
+              default=None,
+              type=int,
+              help='Tear down a given replica')
 # pylint: disable=redefined-builtin
-def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool):
-    """Teardown service(s).
+def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool,
+               replica_id: Optional[int]):
+    """Teardown service(s) or a replica.
 
     SERVICE_NAMES is the name of the service (or glob pattern) to tear down. If
     both SERVICE_NAMES and ``--all`` are supplied, the latter takes precedence.
@@ -4454,6 +4413,12 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool):
         \b
         # Forcefully tear down a service in failed status.
         sky serve down failed-service --purge
+        \b
+        # Tear down a specific replica
+        sky serve down my-service --replica-id 1
+        \b
+        # Forcefully tear down a specific replica, even in failed status.
+        sky serve down my-service --replica-id 1 --purge
     """
     if sum([len(service_names) > 0, all]) != 1:
         argument_str = f'SERVICE_NAMES={",".join(service_names)}' if len(
@@ -4463,22 +4428,45 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool):
             'Can only specify one of SERVICE_NAMES or --all. '
             f'Provided {argument_str!r}.')
 
+    replica_id_is_defined = replica_id is not None
+    if replica_id_is_defined:
+        if len(service_names) != 1:
+            service_names_str = ', '.join(service_names)
+            raise click.UsageError(f'The --replica-id option can only be used '
+                                   f'with a single service name. Got: '
+                                   f'{service_names_str}.')
+        if all:
+            raise click.UsageError('The --replica-id option cannot be used '
+                                   'with the --all option.')
+
     backend_utils.is_controller_accessible(
         controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
         stopped_message='All services should have been terminated.',
         exit_if_not_accessible=True)
 
     if not yes:
-        quoted_service_names = [f'{name!r}' for name in service_names]
-        service_identity_str = f'service(s) {", ".join(quoted_service_names)}'
-        if all:
-            service_identity_str = 'all services'
-        click.confirm(f'Terminating {service_identity_str}. Proceed?',
-                      default=True,
-                      abort=True,
-                      show_default=True)
-
-    serve_lib.down(service_names=service_names, all=all, purge=purge)
+        if replica_id_is_defined:
+            click.confirm(
+                f'Terminating replica ID {replica_id} in '
+                f'{service_names[0]!r}. Proceed?',
+                default=True,
+                abort=True,
+                show_default=True)
+        else:
+            quoted_service_names = [f'{name!r}' for name in service_names]
+            service_identity_str = (f'service(s) '
+                                    f'{", ".join(quoted_service_names)}')
+            if all:
+                service_identity_str = 'all services'
+            click.confirm(f'Terminating {service_identity_str}. Proceed?',
+                          default=True,
+                          abort=True,
+                          show_default=True)
+
+    if replica_id_is_defined:
+        serve_lib.terminate_replica(service_names[0], replica_id, purge)
+    else:
+        serve_lib.down(service_names=service_names, all=all, purge=purge)
 
 
 @serve.command('logs', cls=_DocumentedCodeCommand)
diff --git a/sky/clouds/aws.py b/sky/clouds/aws.py
index 2207a977f25..a0962b17cac 100644
--- a/sky/clouds/aws.py
+++ b/sky/clouds/aws.py
@@ -32,6 +32,14 @@
 
 logger = sky_logging.init_logger(__name__)
 
+# Image ID tags
+_DEFAULT_CPU_IMAGE_ID = 'skypilot:custom-cpu-ubuntu'
+# For GPU-related package version,
+# see sky/clouds/service_catalog/images/provisioners/cuda.sh
+_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu'
+_DEFAULT_GPU_K80_IMAGE_ID = 'skypilot:k80-ubuntu-2004'
+_DEFAULT_NEURON_IMAGE_ID = 'skypilot:neuron-ubuntu-2204'
+
 # This local file (under ~/.aws/) will be uploaded to remote nodes (any
 # cloud), if all of the following conditions hold:
 #   - the current user identity is not using AWS SSO
@@ -217,17 +225,20 @@ def zones_provision_loop(
     @classmethod
     def _get_default_ami(cls, region_name: str, instance_type: str) -> str:
         acc = cls.get_accelerators_from_instance_type(instance_type)
-        image_id = service_catalog.get_image_id_from_tag(
-            'skypilot:gpu-ubuntu-2004', region_name, clouds='aws')
+        image_id = service_catalog.get_image_id_from_tag(_DEFAULT_CPU_IMAGE_ID,
+                                                         region_name,
+                                                         clouds='aws')
         if acc is not None:
+            image_id = service_catalog.get_image_id_from_tag(
+                _DEFAULT_GPU_IMAGE_ID, region_name, clouds='aws')
             assert len(acc) == 1, acc
             acc_name = list(acc.keys())[0]
             if acc_name == 'K80':
                 image_id = service_catalog.get_image_id_from_tag(
-                    'skypilot:k80-ubuntu-2004', region_name, clouds='aws')
+                    _DEFAULT_GPU_K80_IMAGE_ID, region_name, clouds='aws')
             if acc_name in ['Trainium', 'Inferentia']:
                 image_id = service_catalog.get_image_id_from_tag(
-                    'skypilot:neuron-ubuntu-2204', region_name, clouds='aws')
+                    _DEFAULT_NEURON_IMAGE_ID, region_name, clouds='aws')
         if image_id is not None:
             return image_id
         # Raise ResourcesUnavailableError to make sure the failover in
diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py
index afa85f48fa5..adffd32ad88 100644
--- a/sky/clouds/azure.py
+++ b/sky/clouds/azure.py
@@ -329,7 +329,6 @@ def make_deploy_resources_variables(
             runcmd:
               - sed -i 's/#Banner none/Banner none/' /etc/ssh/sshd_config
               - echo '\\nif [ ! -f "/tmp/__restarted" ]; then\\n  sudo systemctl restart ssh\\n  sleep 2\\n  touch /tmp/__restarted\\nfi' >> /home/skypilot:ssh_user/.bashrc
-              - usermod -aG docker skypilot:ssh_user
             write_files:
               - path: /etc/apt/apt.conf.d/20auto-upgrades
                 content: |
diff --git a/sky/clouds/gcp.py b/sky/clouds/gcp.py
index b1015c92979..1b70abf914d 100644
--- a/sky/clouds/gcp.py
+++ b/sky/clouds/gcp.py
@@ -94,6 +94,12 @@
     f'\nTo query common AI images: {colorama.Style.BRIGHT}gcloud compute images list --project deeplearning-platform-release | less{colorama.Style.RESET_ALL}'
 )
 
+# Image ID tags
+_DEFAULT_CPU_IMAGE_ID = 'skypilot:custom-cpu-ubuntu-2204'
+# For GPU-related package version, see sky/clouds/service_catalog/images/provisioners/cuda.sh
+_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-2204'
+_DEFAULT_GPU_K80_IMAGE_ID = 'skypilot:k80-debian-10'
+
 
 def _run_output(cmd):
     proc = subprocess.run(cmd,
@@ -422,7 +428,7 @@ def make_deploy_resources_variables(
         # --no-standard-images
         # We use the debian image, as the ubuntu image has some connectivity
         # issue when first booted.
-        image_id = 'skypilot:cpu-debian-11'
+        image_id = _DEFAULT_CPU_IMAGE_ID
 
         def _failover_disk_tier() -> Optional[resources_utils.DiskTier]:
             if (r.disk_tier is not None and
@@ -471,13 +477,16 @@ def _failover_disk_tier() -> Optional[resources_utils.DiskTier]:
                     'runtime_version']
                 resources_vars['tpu_node_name'] = r.accelerator_args.get(
                     'tpu_name')
+                # TPU VMs require privileged mode for docker containers to
+                # access TPU devices.
+                resources_vars['docker_run_options'] = ['--privileged']
             else:
                 # Convert to GCP names:
                 # https://cloud.google.com/compute/docs/gpus
                 if acc in ('A100-80GB', 'L4'):
                     # A100-80GB and L4 have a different name pattern.
                     resources_vars['gpu'] = f'nvidia-{acc.lower()}'
-                elif acc == 'H100':
+                elif acc in ('H100', 'H100-MEGA'):
                     resources_vars['gpu'] = f'nvidia-{acc.lower()}-80gb'
                 else:
                     resources_vars['gpu'] = 'nvidia-tesla-{}'.format(
@@ -487,10 +496,10 @@ def _failover_disk_tier() -> Optional[resources_utils.DiskTier]:
                     # Though the image is called cu113, it actually has later
                     # versions of CUDA as noted below.
                     # CUDA driver version 470.57.02, CUDA Library 11.4
-                    image_id = 'skypilot:k80-debian-10'
+                    image_id = _DEFAULT_GPU_K80_IMAGE_ID
                 else:
                     # CUDA driver version 535.86.10, CUDA Library 12.2
-                    image_id = 'skypilot:gpu-debian-11'
+                    image_id = _DEFAULT_GPU_IMAGE_ID
 
         if (resources.image_id is not None and
                 resources.extract_docker_image() is None):
@@ -540,6 +549,11 @@ def _failover_disk_tier() -> Optional[resources_utils.DiskTier]:
         resources_vars[
             'force_enable_external_ips'] = skypilot_config.get_nested(
                 ('gcp', 'force_enable_external_ips'), False)
+
+        # Add gVNIC from config
+        resources_vars['enable_gvnic'] = skypilot_config.get_nested(
+            ('gcp', 'enable_gvnic'), False)
+
         return resources_vars
 
     def _get_feasible_launchable_resources(
diff --git a/sky/clouds/lambda_cloud.py b/sky/clouds/lambda_cloud.py
index d3d20fbd41a..0201f4f76ad 100644
--- a/sky/clouds/lambda_cloud.py
+++ b/sky/clouds/lambda_cloud.py
@@ -8,7 +8,7 @@
 from sky import clouds
 from sky import status_lib
 from sky.clouds import service_catalog
-from sky.clouds.utils import lambda_utils
+from sky.provision.lambda_cloud import lambda_utils
 from sky.utils import resources_utils
 
 if typing.TYPE_CHECKING:
@@ -37,10 +37,6 @@ class Lambda(clouds.Cloud):
     _CLOUD_UNSUPPORTED_FEATURES = {
         clouds.CloudImplementationFeatures.STOP: 'Lambda cloud does not support stopping VMs.',
         clouds.CloudImplementationFeatures.CLONE_DISK_FROM_CLUSTER: f'Migrating disk is currently not supported on {_REPR}.',
-        clouds.CloudImplementationFeatures.DOCKER_IMAGE: (
-            f'Docker image is currently not supported on {_REPR}. '
-            'You can try running docker command inside the `run` section in task.yaml.'
-        ),
         clouds.CloudImplementationFeatures.SPOT_INSTANCE: f'Spot instances are not supported in {_REPR}.',
         clouds.CloudImplementationFeatures.IMAGE_ID: f'Specifying image ID is not supported in {_REPR}.',
         clouds.CloudImplementationFeatures.CUSTOM_DISK_TIER: f'Custom disk tiers are not supported in {_REPR}.',
@@ -48,6 +44,9 @@ class Lambda(clouds.Cloud):
         clouds.CloudImplementationFeatures.HOST_CONTROLLERS: f'Host controllers are not supported in {_REPR}.',
     }
 
+    PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT
+    STATUS_VERSION = clouds.StatusVersion.SKYPILOT
+
     @classmethod
     def _unsupported_features_for_resources(
         cls, resources: 'resources_lib.Resources'
@@ -170,12 +169,20 @@ def make_deploy_resources_variables(
         else:
             custom_resources = None
 
-        return {
+        resources_vars = {
             'instance_type': resources.instance_type,
             'custom_resources': custom_resources,
             'region': region.name,
         }
 
+        if acc_dict is not None:
+            # Lambda cloud's docker runtime information does not contain
+            # 'nvidia-container-runtime', causing no GPU option is added to
+            # the docker run command. We patch this by adding it here.
+            resources_vars['docker_run_options'] = ['--gpus all']
+
+        return resources_vars
+
     def _get_feasible_launchable_resources(
         self, resources: 'resources_lib.Resources'
     ) -> 'resources_utils.FeasibleResources':
diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py
index f4ac4d577e3..810e43fe3b5 100644
--- a/sky/clouds/oci.py
+++ b/sky/clouds/oci.py
@@ -17,6 +17,8 @@
    make_deploy_resources_variables(): Bug fix for specify the image_id as
    the ocid of the image in the task.yaml file, in this case the image_id
    for the node config should be set to the ocid instead of a dict.
+ - Hysun He (hysun.he@oracle.com) @ Oct 13, 2024:
+   Support more OS types additional to ubuntu for OCI resources.
 """
 import json
 import logging
@@ -295,10 +297,21 @@ def make_deploy_resources_variables(
             cpus=None if cpus is None else float(cpus),
             disk_tier=resources.disk_tier)
 
+        image_str = self._get_image_str(image_id=resources.image_id,
+                                        instance_type=resources.instance_type,
+                                        region=region.name)
+
+        # pylint: disable=import-outside-toplevel
+        from sky.clouds.service_catalog import oci_catalog
+        os_type = oci_catalog.get_image_os_from_tag(tag=image_str,
+                                                    region=region.name)
+        logger.debug(f'OS type for the image {image_str} is {os_type}')
+
         return {
             'instance_type': instance_type,
             'custom_resources': custom_resources,
             'region': region.name,
+            'os_type': os_type,
             'cpus': str(cpus),
             'memory': resources.memory,
             'disk_size': resources.disk_size,
@@ -501,59 +514,45 @@ def _get_image_id(
         region_name: str,
         instance_type: str,
     ) -> str:
-        if image_id is None:
-            return self._get_default_image(region_name=region_name,
-                                           instance_type=instance_type)
-        if None in image_id:
-            image_id_str = image_id[None]
-        else:
-            assert region_name in image_id, image_id
-            image_id_str = image_id[region_name]
+        image_id_str = self._get_image_str(image_id=image_id,
+                                           instance_type=instance_type,
+                                           region=region_name)
+
         if image_id_str.startswith('skypilot:'):
             image_id_str = service_catalog.get_image_id_from_tag(image_id_str,
                                                                  region_name,
                                                                  clouds='oci')
-            if image_id_str is None:
-                logger.critical(
-                    '! Real image_id not found! - {region_name}:{image_id}')
-                # Raise ResourcesUnavailableError to make sure the failover
-                # in CloudVMRayBackend will be correctly triggered.
-                # TODO(zhwu): This is a information leakage to the cloud
-                # implementor, we need to find a better way to handle this.
-                raise exceptions.ResourcesUnavailableError(
-                    '! ERR: No image found in catalog for region '
-                    f'{region_name}. Try setting a valid image_id.')
+
+        # Image_id should be impossible be None, except for the case when
+        # user specify an image tag which does not exist in the image.csv
+        # catalog file which only possible in "test" / "evaluation" phase.
+        # Therefore, we use assert here.
+        assert image_id_str is not None
 
         logger.debug(f'Got real image_id {image_id_str}')
         return image_id_str
 
-    def _get_default_image(self, region_name: str, instance_type: str) -> str:
+    def _get_image_str(self, image_id: Optional[Dict[Optional[str], str]],
+                       instance_type: str, region: str):
+        if image_id is None:
+            image_str = self._get_default_image_tag(instance_type)
+        elif None in image_id:
+            image_str = image_id[None]
+        else:
+            assert region in image_id, image_id
+            image_str = image_id[region]
+        return image_str
+
+    def _get_default_image_tag(self, instance_type: str) -> str:
         acc = self.get_accelerators_from_instance_type(instance_type)
 
         if acc is None:
             image_tag = oci_utils.oci_config.get_default_image_tag()
-            image_id_str = service_catalog.get_image_id_from_tag(image_tag,
-                                                                 region_name,
-                                                                 clouds='oci')
         else:
             assert len(acc) == 1, acc
             image_tag = oci_utils.oci_config.get_default_gpu_image_tag()
-            image_id_str = service_catalog.get_image_id_from_tag(image_tag,
-                                                                 region_name,
-                                                                 clouds='oci')
 
-        if image_id_str is not None:
-            logger.debug(
-                f'Got default image_id {image_id_str} from tag {image_tag}')
-            return image_id_str
-
-        # Raise ResourcesUnavailableError to make sure the failover in
-        # CloudVMRayBackend will be correctly triggered.
-        # TODO(zhwu): This is a information leakage to the cloud implementor,
-        # we need to find a better way to handle this.
-        raise exceptions.ResourcesUnavailableError(
-            'ERR: No image found in catalog for region '
-            f'{region_name}. Try update your default image_id settings.')
+        return image_tag
 
     def get_vpu_from_disktier(
             self, cpus: Optional[float],
diff --git a/sky/clouds/service_catalog/aws_catalog.py b/sky/clouds/service_catalog/aws_catalog.py
index a44750c4ec4..d156135047b 100644
--- a/sky/clouds/service_catalog/aws_catalog.py
+++ b/sky/clouds/service_catalog/aws_catalog.py
@@ -308,7 +308,17 @@ def list_accelerators(
 
 def get_image_id_from_tag(tag: str, region: Optional[str]) -> Optional[str]:
     """Returns the image id from the tag."""
-    return common.get_image_id_from_tag_impl(_image_df, tag, region)
+    global _image_df
+
+    image_id = common.get_image_id_from_tag_impl(_image_df, tag, region)
+    if image_id is None:
+        # Refresh the image catalog and try again, if the image tag is not
+        # found.
+        logger.debug('Refreshing the image catalog and trying again.')
+        _image_df = common.read_catalog('aws/images.csv',
+                                        pull_frequency_hours=0)
+        image_id = common.get_image_id_from_tag_impl(_image_df, tag, region)
+    return image_id
 
 
 def is_image_tag_valid(tag: str, region: Optional[str]) -> bool:
diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_aws.py b/sky/clouds/service_catalog/data_fetchers/fetch_aws.py
index e0e5ffa21a1..b630123648e 100644
--- a/sky/clouds/service_catalog/data_fetchers/fetch_aws.py
+++ b/sky/clouds/service_catalog/data_fetchers/fetch_aws.py
@@ -538,11 +538,13 @@ def _check_regions_integrity(df: 'pd.DataFrame', name: str):
     instance_df.to_csv('aws/vms.csv', index=False)
     print('AWS Service Catalog saved to aws/vms.csv')
 
-    image_df = get_all_regions_images_df(user_regions)
-    _check_regions_integrity(image_df, 'images')
+    # Disable refreshing images.csv as we are using skypilot custom AMIs
+    # See sky/clouds/service_catalog/images/README.md for more details.
+    # image_df = get_all_regions_images_df(user_regions)
+    # _check_regions_integrity(image_df, 'images')
 
-    image_df.to_csv('aws/images.csv', index=False)
-    print('AWS Images saved to aws/images.csv')
+    # image_df.to_csv('aws/images.csv', index=False)
+    # print('AWS Images saved to aws/images.csv')
 
     if args.az_mappings:
         az_mappings_df = fetch_availability_zone_mappings()
diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py b/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py
index eb69695aa55..097efe74deb 100644
--- a/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py
+++ b/sky/clouds/service_catalog/data_fetchers/fetch_gcp.py
@@ -419,6 +419,11 @@ def _get_gpus_for_zone(zone: str) -> 'pd.DataFrame':
                 if count != 8:
                     # H100 only has 8 cards.
                     continue
+            if 'H100-MEGA-80GB' in gpu_name:
+                gpu_name = 'H100-MEGA'
+                if count != 8:
+                    # H100-MEGA only has 8 cards.
+                    continue
             if 'VWS' in gpu_name:
                 continue
             if gpu_name.startswith('TPU-'):
@@ -447,6 +452,7 @@ def _gpu_info_from_name(name: str) -> Optional[Dict[str, List[Dict[str, Any]]]]:
         'A100-80GB': 80 * 1024,
         'A100': 40 * 1024,
         'H100': 80 * 1024,
+        'H100-MEGA': 80 * 1024,
         'P4': 8 * 1024,
         'T4': 16 * 1024,
         'V100': 16 * 1024,
@@ -491,12 +497,17 @@ def get_gpu_price(row: pd.Series, spot: bool) -> Optional[float]:
             if sku['category']['usageType'] != ondemand_or_spot:
                 continue
 
-            gpu_name = row['AcceleratorName']
-            if gpu_name == 'A100-80GB':
-                gpu_name = 'A100 80GB'
-            if gpu_name == 'H100':
-                gpu_name = 'H100 80GB'
-            if f'{gpu_name} GPU' not in sku['description']:
+            gpu_names = [row['AcceleratorName']]
+            if gpu_names[0] == 'A100-80GB':
+                gpu_names = ['A100 80GB']
+            if gpu_names[0] == 'H100':
+                gpu_names = ['H100 80GB']
+            if gpu_names[0] == 'H100-MEGA':
+                # Seems that H100-MEGA has two different descriptions in SKUs in
+                # different regions: 'H100 80GB Mega' and 'H100 80GB Plus'.
+                gpu_names = ['H100 80GB Mega', 'H100 80GB Plus']
+            if not any(f'{gpu_name} GPU' in sku['description']
+                       for gpu_name in gpu_names):
                 continue
 
             unit_price = _get_unit_price(sku)
diff --git a/sky/clouds/service_catalog/gcp_catalog.py b/sky/clouds/service_catalog/gcp_catalog.py
index f861b51920e..c9e15f602dc 100644
--- a/sky/clouds/service_catalog/gcp_catalog.py
+++ b/sky/clouds/service_catalog/gcp_catalog.py
@@ -98,6 +98,9 @@
     },
     'H100': {
         8: ['a3-highgpu-8g'],
+    },
+    'H100-MEGA': {
+        8: ['a3-megagpu-8g'],
     }
 }
 
diff --git a/sky/clouds/service_catalog/images/README.md b/sky/clouds/service_catalog/images/README.md
new file mode 100644
index 00000000000..31ce7c6d9ce
--- /dev/null
+++ b/sky/clouds/service_catalog/images/README.md
@@ -0,0 +1,72 @@
+# SkyPilot OS Image Generation Guide
+
+## Prerequisites
+You only need to do this once.
+1. Install [Packer](https://developer.hashicorp.com/packer/tutorials/aws-get-started/get-started-install-cli)
+2. Download plugins used by Packer
+```bash
+packer init plugins.pkr.hcl
+```
+3. Setup cloud credentials
+
+## Generate Images
+```bash
+export CLOUD=gcp    # Update this
+export TYPE=gpu    # Update this
+export IMAGE=skypilot-${CLOUD}-${TYPE}-ubuntu
+packer build ${IMAGE}.pkr.hcl
+```
+You will see the image ID after the build is complete.
+
+FYI time to packer build an image:
+
+| Cloud | Type | Approx. Time |
+|-------|------|------------------------|
+| AWS   | GPU  | 15 min          |
+| AWS   | CPU  | 10 min          |
+| GCP   | GPU  | 16 min          |
+| GCP   | CPU  | 5 min          |
+
+### GCP
+```bash
+export IMAGE_NAME=skypilot-gcp-cpu-ubuntu-20241011003407  # Update this
+
+# Make image public
+export IMAGE_ID=projects/sky-dev-465/global/images/${IMAGE_NAME}
+gcloud compute images add-iam-policy-binding ${IMAGE_NAME} --member='allAuthenticatedUsers' --role='roles/compute.imageUser'
+```
+
+### AWS
+1. Generate images for all regions
+```bash
+export IMAGE_ID=ami-0b31b24524afa8e47   # Update this
+
+python aws_utils/image_gen.py --image-id ${IMAGE_ID} --processor ${TYPE}
+```
+2. Add fallback images if any region failed \
+Look for "NEED_FALLBACK" in the output `images.csv` and edit. (You can use public [ubuntu images](https://cloud-images.ubuntu.com/locator/ec2/) as fallback.)
+
+## Test Images
+1. Minimal GPU test: `sky launch --image ${IMAGE_ID} --gpus=L4:1 --cloud ${CLOUD}` then run `nvidia-smi` in the launched instance.
+2. Update the image ID in `sky/clouds/gcp.py` and run the test:
+```bash
+pytest tests/test_smoke.py::test_minimal --gcp
+pytest tests/test_smoke.py::test_huggingface --gcp
+pytest tests/test_smoke.py::test_job_queue_with_docker --gcp
+pytest tests/test_smoke.py::test_cancel_gcp
+```
+
+## Ship Images & Cleanup
+Submit a PR to update [`SkyPilot Catalog`](https://github.com/skypilot-org/skypilot-catalog/tree/master/catalogs) then clean up the old images to avoid extra iamge storage fees.
+
+### GCP
+1. Example PR: [#86](https://github.com/skypilot-org/skypilot-catalog/pull/86)
+2. Go to console and delete old images.
+
+### AWS
+1. Copy the old custom image rows from Catalog's existing `images.csv` to a local `images.csv` in this folder.
+2. Update Catalog with new images. Example PR: [#89](https://github.com/skypilot-org/skypilot-catalog/pull/89)
+3. Delete AMIs across regions by running
+```bash
+python aws_utils/image_delete.py --tag ${TAG}
+```
diff --git a/sky/clouds/service_catalog/images/aws_utils/image_delete.py b/sky/clouds/service_catalog/images/aws_utils/image_delete.py
new file mode 100644
index 00000000000..52cbb5b2382
--- /dev/null
+++ b/sky/clouds/service_catalog/images/aws_utils/image_delete.py
@@ -0,0 +1,63 @@
+"""Delete all images with a given tag and their associated snapshots from images.csv
+
+Example Usage: put images.csv in the same folder as this script and run
+  python image_delete.py  --tag skypilot:custom-gpu-ubuntu-2204
+"""
+
+import argparse
+import csv
+import json
+import subprocess
+
+parser = argparse.ArgumentParser(
+    description='Delete AWS images and their snapshots across regions.')
+parser.add_argument('--tag',
+                    required=True,
+                    help='Tag of the image to delete, see tags in images.csv')
+args = parser.parse_args()
+
+
+def get_snapshots(image_id, region):
+    cmd = f'aws ec2 describe-images --image-ids {image_id} --region {region} --query "Images[*].BlockDeviceMappings[*].Ebs.SnapshotId" --output json'
+    result = subprocess.run(cmd,
+                            shell=True,
+                            check=True,
+                            capture_output=True,
+                            text=True)
+    snapshots = json.loads(result.stdout)
+    return [
+        snapshot for sublist in snapshots for snapshot in sublist if snapshot
+    ]
+
+
+def delete_image_and_snapshots(image_id, region):
+    # Must get snapshots before deleting the image
+    snapshots = get_snapshots(image_id, region)
+
+    # Deregister the image
+    cmd = f'aws ec2 deregister-image --image-id {image_id} --region {region}'
+    subprocess.run(cmd, shell=True, check=True)
+    print(f"Deregistered image {image_id} in region {region}")
+
+    # Delete snapshots
+    for snapshot in snapshots:
+        cmd = f'aws ec2 delete-snapshot --snapshot-id {snapshot} --region {region}'
+        subprocess.run(cmd, shell=True, check=True)
+        print(f'Deleted snapshot {snapshot} in region {region}')
+
+
+def main():
+    with open('images.csv', 'r') as csvfile:
+        reader = csv.DictReader(csvfile)
+        for row in reader:
+            if row['Tag'] == args.tag:
+                try:
+                    delete_image_and_snapshots(row['ImageId'], row['Region'])
+                except subprocess.CalledProcessError as e:
+                    print(
+                        f'Failed to delete image {row["ImageId"]} or its snapshots in region {row["Region"]}: {e}'
+                    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/sky/clouds/service_catalog/images/aws_utils/image_gen.py b/sky/clouds/service_catalog/images/aws_utils/image_gen.py
new file mode 100644
index 00000000000..cb39355ad2c
--- /dev/null
+++ b/sky/clouds/service_catalog/images/aws_utils/image_gen.py
@@ -0,0 +1,151 @@
+"""Copy SkyPilot AMI to multiple regions, make them public, and generate images.csv
+
+Example Usage:
+  python aws_image_gen.py  --source-image-id ami-00000 --processor gpu
+"""
+
+import argparse
+import concurrent.futures
+import csv
+import json
+import os
+import subprocess
+import threading
+import time
+
+parser = argparse.ArgumentParser(
+    description='Generate AWS images across regions.')
+parser.add_argument('--image-id',
+                    required=True,
+                    help='The source AMI ID to copy from')
+parser.add_argument('--processor', required=True, help='e.g. gpu, cpu, etc.')
+parser.add_argument('--region',
+                    default='us-east-1',
+                    help='Region of the source AMI')
+parser.add_argument('--base-image-id',
+                    default='ami-005fc0f236362e99f',
+                    help='The base AMI of the source AMI.')
+parser.add_argument('--os-type', default='ubuntu', help='The OS type')
+parser.add_argument('--os-version', default='22.04', help='The OS version')
+parser.add_argument('--output-csv',
+                    default='images.csv',
+                    help='The output CSV file name')
+args = parser.parse_args()
+
+# 25 regions
+ALL_REGIONS = [
+    # 'us-east-1',  # Source AMI is already in this region
+    'us-east-2',
+    'us-west-1',
+    'us-west-2',
+    'ca-central-1',
+    'eu-central-1',  # need for smoke test
+    'eu-central-2',
+    'eu-west-1',
+    'eu-west-2',
+    'eu-south-1',
+    'eu-south-2',
+    'eu-west-3',
+    'eu-north-1',
+    'me-south-1',
+    'me-central-1',
+    'af-south-1',
+    'ap-east-1',
+    'ap-south-1',
+    'ap-south-2',
+    'ap-northeast-3',
+    'ap-northeast-2',
+    'ap-southeast-1',
+    'ap-southeast-2',
+    'ap-southeast-3',
+    'ap-northeast-1',
+]
+
+
+def make_image_public(image_id, region):
+    unblock_command = f"aws ec2 disable-image-block-public-access --region {region}"
+    subprocess.run(unblock_command, shell=True, check=True)
+    public_command = (
+        f'aws ec2 modify-image-attribute --image-id {image_id} '
+        f'--launch-permission "{{\\\"Add\\\": [{{\\\"Group\\\":\\\"all\\\"}}]}}" --region {region}'
+    )
+    subprocess.run(public_command, shell=True, check=True)
+    print(f"Made {image_id} public")
+
+
+def copy_image_and_make_public(target_region):
+    # Copy the AMI to the target region
+    copy_command = (
+        f"aws ec2 copy-image --source-region {args.region} "
+        f"--source-image-id {args.image_id} --region {target_region} "
+        f"--name 'skypilot-aws-{args.processor}-{args.os_type}-{time.time()}'  --output json"
+    )
+    print(copy_command)
+    result = subprocess.run(copy_command,
+                            shell=True,
+                            check=True,
+                            capture_output=True,
+                            text=True)
+    print(result.stdout)
+    new_image_id = json.loads(result.stdout)['ImageId']
+    print(f"Copied image to {target_region} with new image ID: {new_image_id}")
+
+    # Wait for the image to be available
+    print(f"Waiting for {new_image_id} to be available...")
+    wait_command = f"aws ec2 wait image-available --image-ids {new_image_id} --region {target_region}"
+    subprocess.run(wait_command, shell=True, check=True)
+
+    make_image_public(new_image_id, target_region)
+
+    return new_image_id
+
+
+def write_image_to_csv(image_id, region):
+    with open(args.output_csv, 'a', newline='', encoding='utf-8') as csvfile:
+        writer = csv.writer(csvfile)
+        row = [
+            f'skypilot:custom-{args.processor}-{args.os_type}', region,
+            args.os_type, args.os_version, image_id,
+            time.strftime('%Y%m%d'), args.base_image_id
+        ]
+        writer.writerow(row)
+    print(f"Wrote to CSV: {row}")
+
+
+def main():
+    make_image_public(args.image_id, args.region)
+    if not os.path.exists(args.output_csv):
+        with open(args.output_csv, 'w', newline='') as csvfile:
+            writer = csv.writer(csvfile)
+            writer.writerow([
+                'Tag', 'Region', 'OS', 'OSVersion', 'ImageId', 'CreationDate',
+                'BaseImageId'
+            ])  # Header
+        print(f"No existing {args.output_csv} so created it.")
+
+    # Process other regions
+    image_cache = [(args.image_id, args.region)]
+
+    def process_region(copy_to_region):
+        print(f"Start copying image to {copy_to_region}...")
+        try:
+            new_image_id = copy_image_and_make_public(copy_to_region)
+        except Exception as e:
+            print(f"Error generating image to {copy_to_region}: {str(e)}")
+            new_image_id = 'NEED_FALLBACK'
+            image_cache.append((new_image_id, copy_to_region))
+
+    with concurrent.futures.ThreadPoolExecutor() as executor:
+        executor.map(process_region, ALL_REGIONS)
+    executor.shutdown(wait=True)
+
+    # Sort the images by it's region and write to CSV
+    sorted_image_cache = sorted(image_cache, key=lambda x: x[1])
+    for new_image_id, copy_to_region in sorted_image_cache:
+        write_image_to_csv(new_image_id, copy_to_region)
+
+    print("All done!")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/sky/clouds/service_catalog/images/plugins.pkr.hcl b/sky/clouds/service_catalog/images/plugins.pkr.hcl
new file mode 100644
index 00000000000..e007c1723bf
--- /dev/null
+++ b/sky/clouds/service_catalog/images/plugins.pkr.hcl
@@ -0,0 +1,17 @@
+packer {
+  required_plugins {
+    amazon = {
+      version = ">= 1.2.8"
+      source  = "github.com/hashicorp/amazon"
+    }
+  }
+}
+
+packer {
+  required_plugins {
+    googlecompute = {
+      version = ">= 1.1.1"
+      source  = "github.com/hashicorp/googlecompute"
+    }
+  }
+}
diff --git a/sky/clouds/service_catalog/images/provisioners/cloud.sh b/sky/clouds/service_catalog/images/provisioners/cloud.sh
new file mode 100644
index 00000000000..b326c9fde51
--- /dev/null
+++ b/sky/clouds/service_catalog/images/provisioners/cloud.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+
+PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python
+
+# TODO: keep this dependency installation align with utils/controller_utils.py and setup.py
+install_azure() {
+    echo "Install cloud dependencies on controller: Azure"
+    $PYTHON_EXEC -m pip install "azure-cli>=2.31.0" azure-core "azure-identity>=1.13.0" azure-mgmt-network
+    $PYTHON_EXEC -m pip install azure-storage-blob msgraph-sdk
+}
+
+install_gcp() {
+    echo "Install cloud dependencies on controller: GCP"
+    $PYTHON_EXEC -m pip install "google-api-python-client>=2.69.0"
+    $PYTHON_EXEC -m pip install google-cloud-storage
+    if ! gcloud --help > /dev/null 2>&1; then
+        pushd /tmp &>/dev/null
+        mkdir -p ~/.sky/logs
+        wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log
+        tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log
+        rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log
+        mv google-cloud-sdk ~/
+        ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1
+        echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc
+        source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1
+        popd &>/dev/null
+    fi
+}
+
+install_aws() {
+    echo "Install cloud dependencies on controller: AWS"
+    $PYTHON_EXEC -m pip install botocore>=1.29.10 boto3>=1.26.1
+    $PYTHON_EXEC -m pip install "urllib3<2" awscli>=1.27.10 "colorama<0.4.5"
+}
+
+if [ "$CLOUD" = "azure" ]; then
+    install_azure
+elif [ "$CLOUD" = "gcp" ]; then
+    install_gcp
+elif [ "$CLOUD" = "aws" ]; then
+    install_aws
+else
+    echo "Error: Unknown cloud $CLOUD so not installing any cloud dependencies."
+fi
+
+if [ $? -eq 0 ]; then
+    echo "Successfully installed cloud dependencies on controller: $CLOUD"
+else
+    echo "Error: Failed to install cloud dependencies on controller: $CLOUD"
+fi
diff --git a/sky/clouds/service_catalog/images/provisioners/cuda.sh b/sky/clouds/service_catalog/images/provisioners/cuda.sh
new file mode 100644
index 00000000000..1b2b4ec977e
--- /dev/null
+++ b/sky/clouds/service_catalog/images/provisioners/cuda.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+# This script installs the latest CUDA driver and toolkit version that is compatible with all GPU types.
+# For CUDA driver version, choose the latest version that works for ALL GPU types.
+#   GCP: https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#minimum-driver
+#   AWS: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html
+export DEBIAN_FRONTEND=noninteractive
+
+wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+sudo dpkg -i cuda-keyring_1.1-1_all.deb
+sudo apt-get update
+
+# Make sure CUDA toolkit and driver versions are compatible: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
+# Current State: Driver Version 535.183.06 and CUDA Version 12.2
+sudo apt-get install -y cuda-drivers-535
+sudo apt-get install -y cuda-toolkit-12-4
+
+# Install cuDNN
+# https://docs.nvidia.com/deeplearning/cudnn/latest/installation/linux.html#installing-on-linux
+sudo apt-get install libcudnn8
+sudo apt-get install libcudnn8-dev
+
+# Cleanup
+rm cuda-keyring_1.1-1_all.deb
diff --git a/sky/clouds/service_catalog/images/provisioners/docker.sh b/sky/clouds/service_catalog/images/provisioners/docker.sh
new file mode 100644
index 00000000000..da2366408ab
--- /dev/null
+++ b/sky/clouds/service_catalog/images/provisioners/docker.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+# Add Docker's official GPG key:
+sudo apt-get update
+sudo apt-get install ca-certificates curl
+sudo install -m 0755 -d /etc/apt/keyrings
+sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
+sudo chmod a+r /etc/apt/keyrings/docker.asc
+
+# Add the repository to Apt sources:
+echo \
+  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
+  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
+  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
+sudo apt-get update 
+
+# Install Docker
+sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+
+# Add user to Docker group so that user does not need to use sudo to run Docker commands
+sudo usermod -aG docker $USER
+newgrp docker
diff --git a/sky/clouds/service_catalog/images/provisioners/nvidia-container-toolkit.sh b/sky/clouds/service_catalog/images/provisioners/nvidia-container-toolkit.sh
new file mode 100644
index 00000000000..b6b3625176b
--- /dev/null
+++ b/sky/clouds/service_catalog/images/provisioners/nvidia-container-toolkit.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+
+set -e
+
+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg &&
+	curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
+	sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
+		sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
+
+sudo apt-get update
+sudo apt-get install -y nvidia-container-toolkit
+
+# if there's an empty /etc/docker/daemon.json, `nvidia-ctk runtime configure --runtime=docker` will fail
+if [ -f /etc/docker/daemon.json ] && [ ! -s /etc/docker/daemon.json ]; then
+	sudo rm /etc/docker/daemon.json
+fi
+
+sudo nvidia-ctk runtime configure --runtime=docker
+sudo systemctl restart docker
+
+# Validate
+if sudo docker info -f "{{.Runtimes}}" | grep "nvidia-container-runtime"; then
+    echo "Successfully installed NVIDIA container runtime"
+else
+	echo "Failed to install NVIDIA container runtime"
+fi
diff --git a/sky/clouds/service_catalog/images/provisioners/skypilot.sh b/sky/clouds/service_catalog/images/provisioners/skypilot.sh
new file mode 100644
index 00000000000..ff2aa06b2b6
--- /dev/null
+++ b/sky/clouds/service_catalog/images/provisioners/skypilot.sh
@@ -0,0 +1,69 @@
+#!/bin/bash
+
+# Stop and disable unattended-upgrades
+sudo systemctl stop unattended-upgrades || true
+sudo systemctl disable unattended-upgrades || true
+sudo sed -i 's/Unattended-Upgrade "1"/Unattended-Upgrade "0"/g' /etc/apt/apt.conf.d/20auto-upgrades || true
+
+# Configure dpkg
+sudo dpkg --configure --force-overwrite -a
+
+# Apt-get installs
+sudo apt-get install jq -y
+
+# Create necessary directories
+mkdir -p ~/sky_workdir
+mkdir -p ~/.sky/
+mkdir -p ~/.sky/sky_app
+mkdir -p ~/.ssh
+touch ~/.ssh/config
+
+# Install Miniconda
+curl -o Miniconda3-Linux-x86_64.sh https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh
+bash Miniconda3-Linux-x86_64.sh -b
+eval "$(~/miniconda3/bin/conda shell.bash hook)"
+rm Miniconda3-Linux-x86_64.sh
+conda init
+conda config --set auto_activate_base true
+conda activate base
+
+# Conda, Python
+echo "Creating conda env with Python 3.10"
+conda create -y -n skypilot-runtime python=3.10
+conda activate skypilot-runtime
+export PIP_DISABLE_PIP_VERSION_CHECK=1
+echo PATH=$PATH
+python3 -m venv ~/skypilot-runtime
+PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python
+
+# Pip installs
+$PYTHON_EXEC -m pip install "setuptools<70"
+$PYTHON_EXEC -m pip install "grpcio!=1.48.0,<=1.51.3,>=1.42.0"
+$PYTHON_EXEC -m pip install "skypilot-nightly"
+
+# Install ray
+RAY_ADDRESS=127.0.0.1:6380
+$PYTHON_EXEC -m pip install --exists-action w -U ray[default]==2.9.3
+export PATH=$PATH:$HOME/.local/bin
+source ~/skypilot-runtime/bin/activate
+which ray > ~/.sky/ray_path || exit 1
+$PYTHON_EXEC -m pip list | grep "ray " | grep 2.9.3 2>&1 > /dev/null && {
+  $PYTHON_EXEC -c "from sky.skylet.ray_patches import patch; patch()" || exit 1
+}
+
+# System configurations
+sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf'
+sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'
+sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity
+sudo systemctl daemon-reload
+
+# Stop and disable Jupyter service
+sudo systemctl stop jupyter > /dev/null 2>&1 || true
+sudo systemctl disable jupyter > /dev/null 2>&1 || true
+
+# Configure fuse
+[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf'
+
+# Cleanup
+# Remove SkyPilot in OS image because when user sky launch we will install whatever version of SkyPilot user has on their local machine.
+$PYTHON_EXEC -m pip uninstall "skypilot-nightly" -y
diff --git a/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl
new file mode 100644
index 00000000000..c21fbf51b20
--- /dev/null
+++ b/sky/clouds/service_catalog/images/skypilot-aws-cpu-ubuntu.pkr.hcl
@@ -0,0 +1,47 @@
+variable "region" {
+  type    = string
+  default = "us-east-1"
+}
+
+locals {
+  timestamp = regex_replace(timestamp(), "[- TZ:]", "")
+}
+
+source "amazon-ebs" "cpu-ubuntu" {
+  ami_name      = "skypilot-aws-cpu-ubuntu-${local.timestamp}"
+  instance_type = "t2.micro"
+  region        = var.region
+  ssh_username  = "ubuntu"
+  source_ami_filter {
+    filters = {
+      name                = "ubuntu/images/*ubuntu-jammy-22.04-amd64-server-*"
+      root-device-type    = "ebs"
+      virtualization-type = "hvm"
+    }
+    most_recent = true
+    owners      = ["099720109477"]
+  }
+  launch_block_device_mappings {
+    device_name = "/dev/sda1"
+    volume_size = 8
+    volume_type = "gp2"
+    delete_on_termination = true
+  }
+}
+
+build {
+  name    = "aws-cpu-ubuntu-build"
+  sources = ["sources.amazon-ebs.cpu-ubuntu"]
+  provisioner "shell" {
+    script = "./provisioners/docker.sh"
+  }
+  provisioner "shell" {
+    script = "./provisioners/skypilot.sh"
+  }
+  provisioner "shell" {
+    environment_vars = [
+      "CLOUD=aws",
+    ]
+    script = "./provisioners/cloud.sh"
+  }
+}
diff --git a/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl
new file mode 100644
index 00000000000..c4a8efac4dc
--- /dev/null
+++ b/sky/clouds/service_catalog/images/skypilot-aws-gpu-ubuntu.pkr.hcl
@@ -0,0 +1,55 @@
+variable "region" {
+  type    = string
+  default = "us-east-1"
+}
+
+locals {
+  timestamp = regex_replace(timestamp(), "[- TZ:]", "")
+}
+
+source "amazon-ebs" "gpu-ubuntu" {
+  ami_name      = "skypilot-aws-gpu-ubuntu-${local.timestamp}"
+  instance_type = "g6.xlarge"
+  region        = var.region
+  ssh_username  = "ubuntu"
+  source_ami_filter {
+    filters = {
+      name                = "ubuntu/images/*ubuntu-jammy-22.04-amd64-server-*"
+      root-device-type    = "ebs"
+      virtualization-type = "hvm"
+    }
+    most_recent = true
+    owners      = ["099720109477"]
+  }
+  launch_block_device_mappings {
+    device_name = "/dev/sda1"
+    volume_size = 30
+    volume_type = "gp2"
+    delete_on_termination = true
+  }
+}
+
+build {
+  name = "aws-gpu-ubuntu-build"
+  sources = [
+    "source.amazon-ebs.gpu-ubuntu"
+  ]
+  provisioner "shell" {
+    script = "./provisioners/docker.sh"
+  }
+  provisioner "shell" {
+    script = "./provisioners/cuda.sh"
+  }
+  provisioner "shell" {
+    script = "./provisioners/nvidia-container-toolkit.sh"
+  }
+  provisioner "shell" {
+    script = "./provisioners/skypilot.sh"
+  }
+  provisioner "shell" {
+    environment_vars = [
+      "CLOUD=aws",
+    ]
+    script = "./provisioners/cloud.sh"
+  }
+}
diff --git a/sky/clouds/service_catalog/images/skypilot-gcp-cpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-gcp-cpu-ubuntu.pkr.hcl
new file mode 100644
index 00000000000..bf3af0519e4
--- /dev/null
+++ b/sky/clouds/service_catalog/images/skypilot-gcp-cpu-ubuntu.pkr.hcl
@@ -0,0 +1,33 @@
+
+locals {
+  timestamp = regex_replace(timestamp(), "[- TZ:]", "")
+}
+
+source "googlecompute" "cpu-ubuntu" {
+  project_id          = "sky-dev-465"
+  image_name          = "skypilot-gcp-cpu-ubuntu-${local.timestamp}"
+  source_image_family = "ubuntu-2204-lts"
+  zone                = "us-west1-a"
+  image_description   = "SkyPilot custom image for launching GCP CPU instances."
+  tags                = ["packer"]
+  disk_size           = 10
+  machine_type        = "e2-medium"
+  ssh_username        = "gcpuser"
+}
+
+build {
+  name    = "gcp-cpu-ubuntu-build"
+  sources = ["sources.googlecompute.cpu-ubuntu"]
+  provisioner "shell" {
+    script = "./provisioners/docker.sh"
+  }
+  provisioner "shell" {
+    script = "./provisioners/skypilot.sh"
+  }
+  provisioner "shell" {
+    environment_vars = [
+      "CLOUD=gcp",
+    ]
+    script = "./provisioners/cloud.sh"
+  }
+}
diff --git a/sky/clouds/service_catalog/images/skypilot-gcp-gpu-ubuntu.pkr.hcl b/sky/clouds/service_catalog/images/skypilot-gcp-gpu-ubuntu.pkr.hcl
new file mode 100644
index 00000000000..f46d414493b
--- /dev/null
+++ b/sky/clouds/service_catalog/images/skypilot-gcp-gpu-ubuntu.pkr.hcl
@@ -0,0 +1,46 @@
+variable "zone" {
+  type    = string
+  default = "us-west1-a"
+}
+
+locals {
+  timestamp = regex_replace(timestamp(), "[- TZ:]", "")
+}
+
+source "googlecompute" "gpu-ubuntu" {
+  image_name          = "skypilot-gcp-gpu-ubuntu-${local.timestamp}"
+  project_id          = "sky-dev-465"
+  source_image_family = "ubuntu-2204-lts"
+  zone                = var.zone
+  image_description   = "SkyPilot custom image for launching GCP GPU instances."
+  tags                = ["packer", "gpu", "ubuntu"]
+  disk_size           = 50
+  machine_type        = "g2-standard-4"
+  accelerator_type    = "projects/sky-dev-465/zones/${var.zone}/acceleratorTypes/nvidia-l4"
+  accelerator_count   = 1
+  on_host_maintenance = "TERMINATE"
+  ssh_username        = "gcpuser"
+}
+
+build {
+  name    = "gcp-gpu-ubuntu-build"
+  sources = ["sources.googlecompute.gpu-ubuntu"]
+  provisioner "shell" {
+    script = "./provisioners/docker.sh"
+  }
+  provisioner "shell" {
+    script = "./provisioners/cuda.sh"
+  }
+  provisioner "shell" {
+    script = "./provisioners/nvidia-container-toolkit.sh"
+  }
+  provisioner "shell" {
+    script = "./provisioners/skypilot.sh"
+  }
+  provisioner "shell" {
+    environment_vars = [
+      "CLOUD=gcp",
+    ]
+    script = "./provisioners/cloud.sh"
+  }
+}
diff --git a/sky/clouds/service_catalog/oci_catalog.py b/sky/clouds/service_catalog/oci_catalog.py
index a18dee79be5..47d0489f6ab 100644
--- a/sky/clouds/service_catalog/oci_catalog.py
+++ b/sky/clouds/service_catalog/oci_catalog.py
@@ -7,6 +7,8 @@
  - Hysun He (hysun.he@oracle.com) @ Apr, 2023: Initial implementation
  - Hysun He (hysun.he@oracle.com) @ Jun, 2023: Reduce retry times by
    excluding those unsubscribed regions.
+ - Hysun He (hysun.he@oracle.com) @ Oct 14, 2024: Bug fix for validation
+   of the Marketplace images
 """
 
 import logging
@@ -206,4 +208,24 @@ def get_image_id_from_tag(tag: str, region: Optional[str]) -> Optional[str]:
 
 def is_image_tag_valid(tag: str, region: Optional[str]) -> bool:
     """Returns whether the image tag is valid."""
+    # Oct.14, 2024 by Hysun He: Marketplace images are region neutral, so don't
+    # check with region for the Marketplace images.
+    df = _image_df[_image_df['Tag'].str.fullmatch(tag)]
+    if df.empty:
+        return False
+    app_catalog_listing_id = df['AppCatalogListingId'].iloc[0]
+    if app_catalog_listing_id:
+        return True
     return common.is_image_tag_valid_impl(_image_df, tag, region)
+
+
+def get_image_os_from_tag(tag: str, region: Optional[str]) -> Optional[str]:
+    del region
+    df = _image_df[_image_df['Tag'].str.fullmatch(tag)]
+    if df.empty:
+        os_type = oci_utils.oci_config.get_default_image_os()
+    else:
+        os_type = df['OS'].iloc[0]
+
+    logger.debug(f'Operation system for the image {tag} is {os_type}')
+    return os_type
diff --git a/sky/clouds/utils/oci_utils.py b/sky/clouds/utils/oci_utils.py
index 3d11bab24da..86647071f3e 100644
--- a/sky/clouds/utils/oci_utils.py
+++ b/sky/clouds/utils/oci_utils.py
@@ -1,7 +1,9 @@
 """OCI Configuration.
 History:
- - Zhanghao Wu @ Oct 2023: Formatting and refactoring
  - Hysun He (hysun.he@oracle.com) @ Apr, 2023: Initial implementation
+ - Zhanghao Wu @ Oct 2023: Formatting and refactoring
+ - Hysun He (hysun.he@oracle.com) @ Oct, 2024: Add default image OS
+   configuration.
 """
 import logging
 import os
@@ -121,5 +123,13 @@ def get_profile(cls) -> str:
         return skypilot_config.get_nested(
             ('oci', 'default', 'oci_config_profile'), 'DEFAULT')
 
+    @classmethod
+    def get_default_image_os(cls) -> str:
+        # Get the default image OS. Instead of hardcoding, we give a choice to
+        # set the default image OS type in the sky's user-config file. (if not
+        # specified, use the hardcode one at last)
+        return skypilot_config.get_nested(('oci', 'default', 'image_os_type'),
+                                          'ubuntu')
+
 
 oci_config = OCIConfig()
diff --git a/sky/core.py b/sky/core.py
index fa695bda687..496b8b8ad5e 100644
--- a/sky/core.py
+++ b/sky/core.py
@@ -1,7 +1,7 @@
 """SDK functions for cluster/job management."""
 import getpass
 import typing
-from typing import Any, Dict, List, Optional, Union
+from typing import Any, Dict, List, Optional, Tuple, Union
 
 import colorama
 
@@ -11,10 +11,12 @@
 from sky import data
 from sky import exceptions
 from sky import global_user_state
+from sky import jobs as managed_jobs
 from sky import sky_logging
 from sky import status_lib
 from sky import task
 from sky.backends import backend_utils
+from sky.provision.kubernetes import utils as kubernetes_utils
 from sky.skylet import constants
 from sky.skylet import job_lib
 from sky.usage import usage_lib
@@ -111,6 +113,79 @@ def status(cluster_names: Optional[Union[str, List[str]]] = None,
                                       cluster_names=cluster_names)
 
 
+def status_kubernetes(
+) -> Tuple[List['kubernetes_utils.KubernetesSkyPilotClusterInfo'],
+           List['kubernetes_utils.KubernetesSkyPilotClusterInfo'], List[Dict[
+               str, Any]], Optional[str]]:
+    """Get all SkyPilot clusters and jobs in the Kubernetes cluster.
+
+    Managed jobs and services are also included in the clusters returned.
+    The caller must parse the controllers to identify which clusters are run
+    as managed jobs or services.
+all_clusters, unmanaged_clusters, all_jobs, context
+    Returns:
+        A tuple containing:
+        - all_clusters: List of KubernetesSkyPilotClusterInfo with info for
+            all clusters, including managed jobs, services and controllers.
+        - unmanaged_clusters: List of KubernetesSkyPilotClusterInfo with info
+            for all clusters excluding managed jobs and services. Controllers
+            are included.
+        - all_jobs: List of managed jobs from all controllers. Each entry is a
+            dictionary job info, see jobs.queue_from_kubernetes_pod for details.
+        - context: Kubernetes context used to fetch the cluster information.
+    """
+    context = kubernetes_utils.get_current_kube_config_context_name()
+    try:
+        pods = kubernetes_utils.get_skypilot_pods(context)
+    except exceptions.ResourcesUnavailableError as e:
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError('Failed to get SkyPilot pods from '
+                             f'Kubernetes: {str(e)}') from e
+    all_clusters, jobs_controllers, _ = (kubernetes_utils.process_skypilot_pods(
+        pods, context))
+    all_jobs = []
+    with rich_utils.safe_status(
+            ux_utils.spinner_message(
+                '[bold cyan]Checking in-progress managed jobs[/]')) as spinner:
+        for i, job_controller_info in enumerate(jobs_controllers):
+            user = job_controller_info.user
+            pod = job_controller_info.pods[0]
+            status_message = '[bold cyan]Checking managed jobs controller'
+            if len(jobs_controllers) > 1:
+                status_message += f's ({i + 1}/{len(jobs_controllers)})'
+            spinner.update(f'{status_message}[/]')
+            try:
+                job_list = managed_jobs.queue_from_kubernetes_pod(
+                    pod.metadata.name)
+            except RuntimeError as e:
+                logger.warning('Failed to get managed jobs from controller '
+                               f'{pod.metadata.name}: {str(e)}')
+                job_list = []
+            # Add user field to jobs
+            for job in job_list:
+                job['user'] = user
+            all_jobs.extend(job_list)
+    # Reconcile cluster state between managed jobs and clusters:
+    # To maintain a clear separation between regular SkyPilot clusters
+    # and those from managed jobs, we need to exclude the latter from
+    # the main cluster list.
+    # We do this by reconstructing managed job cluster names from each
+    # job's name and ID. We then use this set to filter out managed
+    # clusters from the main cluster list. This is necessary because there
+    # are no identifiers distinguishing clusters from managed jobs from
+    # regular clusters.
+    managed_job_cluster_names = set()
+    for job in all_jobs:
+        # Managed job cluster name is <job_name>-<job_id>
+        managed_cluster_name = f'{job["job_name"]}-{job["job_id"]}'
+        managed_job_cluster_names.add(managed_cluster_name)
+    unmanaged_clusters = [
+        c for c in all_clusters
+        if c.cluster_name not in managed_job_cluster_names
+    ]
+    return all_clusters, unmanaged_clusters, all_jobs, context
+
+
 def endpoints(cluster: str,
               port: Optional[Union[int, str]] = None) -> Dict[int, str]:
     """Gets the endpoint for a given cluster and port number (endpoint).
diff --git a/sky/provision/__init__.py b/sky/provision/__init__.py
index 41d985ade41..bbe92b68c3a 100644
--- a/sky/provision/__init__.py
+++ b/sky/provision/__init__.py
@@ -19,6 +19,7 @@
 from sky.provision import fluidstack
 from sky.provision import gcp
 from sky.provision import kubernetes
+from sky.provision import lambda_cloud
 from sky.provision import runpod
 from sky.provision import vsphere
 from sky.utils import command_runner
@@ -39,6 +40,8 @@ def _wrapper(*args, **kwargs):
             provider_name = kwargs.pop('provider_name')
 
         module_name = provider_name.lower()
+        if module_name == 'lambda':
+            module_name = 'lambda_cloud'
         module = globals().get(module_name)
         assert module is not None, f'Unknown provider: {module_name}'
 
diff --git a/sky/provision/docker_utils.py b/sky/provision/docker_utils.py
index 7bfa1724b83..3ee5d4dfc0c 100644
--- a/sky/provision/docker_utils.py
+++ b/sky/provision/docker_utils.py
@@ -253,12 +253,13 @@ def initialize(self) -> str:
             # issue with nvidia container toolkit:
             # https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
             self._run(
-                '[ -f /etc/docker/daemon.json ] || '
+                '{ which jq || sudo apt update && sudo apt install -y jq; } && '
+                '{ [ -f /etc/docker/daemon.json ] || '
                 'echo "{}" | sudo tee /etc/docker/daemon.json;'
                 'sudo jq \'.["exec-opts"] = ["native.cgroupdriver=cgroupfs"]\' '
                 '/etc/docker/daemon.json > /tmp/daemon.json;'
                 'sudo mv /tmp/daemon.json /etc/docker/daemon.json;'
-                'sudo systemctl restart docker')
+                'sudo systemctl restart docker; } || true')
             user_docker_run_options = self.docker_config.get('run_options', [])
             start_command = docker_start_cmds(
                 specific_image,
@@ -335,7 +336,11 @@ def initialize(self) -> str:
 
     def _check_docker_installed(self):
         no_exist = 'NoExist'
+        # SkyPilot: Add the current user to the docker group first (if needed),
+        # before checking if docker is installed to avoid permission issues.
         cleaned_output = self._run(
+            'id -nG $USER | grep -qw docker || '
+            'sudo usermod -aG docker $USER > /dev/null 2>&1;'
             f'command -v {self.docker_cmd} || echo {no_exist!r}')
         if no_exist in cleaned_output or 'docker' not in cleaned_output:
             logger.error(
@@ -424,8 +429,8 @@ def _auto_configure_shm(self, run_options: List[str]) -> List[str]:
     def _check_container_exited(self) -> bool:
         if self.initialized:
             return True
-        output = (self._run(check_docker_running_cmd(self.container_name,
-                                                     self.docker_cmd),
-                            wait_for_docker_daemon=True))
-        return 'false' in output.lower(
-        ) and 'no such object' not in output.lower()
+        output = self._run(check_docker_running_cmd(self.container_name,
+                                                    self.docker_cmd),
+                           wait_for_docker_daemon=True)
+        return ('false' in output.lower() and
+                'no such object' not in output.lower())
diff --git a/sky/provision/gcp/config.py b/sky/provision/gcp/config.py
index 416f0c1a694..a8292669a7c 100644
--- a/sky/provision/gcp/config.py
+++ b/sky/provision/gcp/config.py
@@ -670,8 +670,12 @@ def _configure_subnet(region: str, cluster_name: str,
         'accessConfigs': [{
             'name': 'External NAT',
             'type': 'ONE_TO_ONE_NAT',
-        }],
+        }]
     }]
+    # Add gVNIC if specified in config
+    enable_gvnic = config.provider_config.get('enable_gvnic', False)
+    if enable_gvnic:
+        default_interfaces[0]['nicType'] = 'gVNIC'
     enable_external_ips = _enable_external_ips(config)
     if not enable_external_ips:
         # Removing this key means the VM will not be assigned an external IP.
diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py
index 3924074838e..0156c4d1091 100644
--- a/sky/provision/kubernetes/utils.py
+++ b/sky/provision/kubernetes/utils.py
@@ -15,9 +15,11 @@
 import yaml
 
 import sky
+from sky import clouds
 from sky import exceptions
 from sky import sky_logging
 from sky import skypilot_config
+from sky import status_lib
 from sky.adaptors import kubernetes
 from sky.provision import constants as provision_constants
 from sky.provision.kubernetes import network_utils
@@ -30,6 +32,7 @@
 
 if typing.TYPE_CHECKING:
     from sky import backends
+    from sky import resources as resources_lib
 
 # TODO(romilb): Move constants to constants.py
 DEFAULT_NAMESPACE = 'default'
@@ -2023,3 +2026,113 @@ def get_skypilot_pods(context: Optional[str] = None) -> List[Any]:
             'kubectl get pods --selector=skypilot-cluster --all-namespaces'
         ) from None
     return pods
+
+
+@dataclasses.dataclass
+class KubernetesSkyPilotClusterInfo:
+    cluster_name_on_cloud: str
+    cluster_name: str
+    user: str
+    status: status_lib.ClusterStatus
+    pods: List[Any]
+    launched_at: float
+    resources: 'resources_lib.Resources'
+    resources_str: str
+
+
+def process_skypilot_pods(
+    pods: List[Any],
+    context: Optional[str] = None
+) -> Tuple[List[KubernetesSkyPilotClusterInfo],
+           List[KubernetesSkyPilotClusterInfo],
+           List[KubernetesSkyPilotClusterInfo]]:
+    """Process SkyPilot pods on k8s to extract cluster and controller info.
+
+    Args:
+        pods: List of Kubernetes pod objects.
+        context: Kubernetes context name, used to detect GPU label formatter.
+
+    Returns:
+        A tuple containing:
+        - List of KubernetesSkyPilotClusterInfo with all cluster info.
+        - List of KubernetesSkyPilotClusterInfo with job controller info.
+        - List of KubernetesSkyPilotClusterInfo with serve controller info.
+    """
+    # pylint: disable=import-outside-toplevel
+    from sky import resources as resources_lib
+    clusters: Dict[str, KubernetesSkyPilotClusterInfo] = {}
+    jobs_controllers: List[KubernetesSkyPilotClusterInfo] = []
+    serve_controllers: List[KubernetesSkyPilotClusterInfo] = []
+
+    for pod in pods:
+        cluster_name_on_cloud = pod.metadata.labels.get('skypilot-cluster')
+        cluster_name = cluster_name_on_cloud.rsplit(
+            '-', 1
+        )[0]  # Remove the user hash to get cluster name (e.g., mycluster-2ea4)
+        if cluster_name_on_cloud not in clusters:
+            # Parse the start time for the cluster
+            start_time = pod.status.start_time
+            if start_time is not None:
+                start_time = pod.status.start_time.timestamp()
+
+            # Parse resources
+            cpu_request = parse_cpu_or_gpu_resource(
+                pod.spec.containers[0].resources.requests.get('cpu', '0'))
+            memory_request = parse_memory_resource(
+                pod.spec.containers[0].resources.requests.get('memory', '0'),
+                unit='G')
+            gpu_count = parse_cpu_or_gpu_resource(
+                pod.spec.containers[0].resources.requests.get(
+                    'nvidia.com/gpu', '0'))
+            gpu_name = None
+            if gpu_count > 0:
+                label_formatter, _ = (detect_gpu_label_formatter(context))
+                assert label_formatter is not None, (
+                    'GPU label formatter cannot be None if there are pods '
+                    f'requesting GPUs: {pod.metadata.name}')
+                gpu_label = label_formatter.get_label_key()
+                # Get GPU name from pod node selector
+                if pod.spec.node_selector is not None:
+                    gpu_name = label_formatter.get_accelerator_from_label_value(
+                        pod.spec.node_selector.get(gpu_label))
+
+            resources = resources_lib.Resources(
+                cloud=clouds.Kubernetes(),
+                cpus=int(cpu_request),
+                memory=int(memory_request),
+                accelerators=(f'{gpu_name}:{gpu_count}'
+                              if gpu_count > 0 else None))
+            if pod.status.phase == 'Pending':
+                # If pod is pending, do not show it in the status
+                continue
+
+            cluster_info = KubernetesSkyPilotClusterInfo(
+                cluster_name_on_cloud=cluster_name_on_cloud,
+                cluster_name=cluster_name,
+                user=pod.metadata.labels.get('skypilot-user'),
+                status=status_lib.ClusterStatus.UP,
+                pods=[],
+                launched_at=start_time,
+                resources=resources,
+                resources_str='')
+            clusters[cluster_name_on_cloud] = cluster_info
+            # Check if cluster name is name of a controller
+            # Can't use controller_utils.Controllers.from_name(cluster_name)
+            # because hash is different across users
+            if 'sky-jobs-controller' in cluster_name_on_cloud:
+                jobs_controllers.append(cluster_info)
+            elif 'sky-serve-controller' in cluster_name_on_cloud:
+                serve_controllers.append(cluster_info)
+        else:
+            # Update start_time if this pod started earlier
+            pod_start_time = pod.status.start_time
+            if pod_start_time is not None:
+                pod_start_time = pod_start_time.timestamp()
+                if pod_start_time < clusters[cluster_name_on_cloud].launched_at:
+                    clusters[cluster_name_on_cloud].launched_at = pod_start_time
+        clusters[cluster_name_on_cloud].pods.append(pod)
+    # Update resources_str in clusters:
+    for cluster in clusters.values():
+        num_pods = len(cluster.pods)
+        cluster.resources_str = f'{num_pods}x {cluster.resources}'
+    return list(clusters.values()), jobs_controllers, serve_controllers
diff --git a/sky/provision/lambda_cloud/__init__.py b/sky/provision/lambda_cloud/__init__.py
new file mode 100644
index 00000000000..4992df4531b
--- /dev/null
+++ b/sky/provision/lambda_cloud/__init__.py
@@ -0,0 +1,11 @@
+"""Lambda provisioner for SkyPilot."""
+
+from sky.provision.lambda_cloud.config import bootstrap_instances
+from sky.provision.lambda_cloud.instance import cleanup_ports
+from sky.provision.lambda_cloud.instance import get_cluster_info
+from sky.provision.lambda_cloud.instance import open_ports
+from sky.provision.lambda_cloud.instance import query_instances
+from sky.provision.lambda_cloud.instance import run_instances
+from sky.provision.lambda_cloud.instance import stop_instances
+from sky.provision.lambda_cloud.instance import terminate_instances
+from sky.provision.lambda_cloud.instance import wait_instances
diff --git a/sky/provision/lambda_cloud/config.py b/sky/provision/lambda_cloud/config.py
new file mode 100644
index 00000000000..3066e7747fd
--- /dev/null
+++ b/sky/provision/lambda_cloud/config.py
@@ -0,0 +1,10 @@
+"""Lambda Cloud configuration bootstrapping"""
+
+from sky.provision import common
+
+
+def bootstrap_instances(
+        region: str, cluster_name: str,
+        config: common.ProvisionConfig) -> common.ProvisionConfig:
+    del region, cluster_name  # unused
+    return config
diff --git a/sky/provision/lambda_cloud/instance.py b/sky/provision/lambda_cloud/instance.py
new file mode 100644
index 00000000000..d10c36496ab
--- /dev/null
+++ b/sky/provision/lambda_cloud/instance.py
@@ -0,0 +1,261 @@
+"""Lambda instance provisioning."""
+
+import time
+from typing import Any, Dict, List, Optional
+
+from sky import authentication as auth
+from sky import sky_logging
+from sky import status_lib
+from sky.provision import common
+import sky.provision.lambda_cloud.lambda_utils as lambda_utils
+from sky.utils import common_utils
+from sky.utils import ux_utils
+
+POLL_INTERVAL = 1
+
+logger = sky_logging.init_logger(__name__)
+_lambda_client = None
+
+
+def _get_lambda_client():
+    global _lambda_client
+    if _lambda_client is None:
+        _lambda_client = lambda_utils.LambdaCloudClient()
+    return _lambda_client
+
+
+def _filter_instances(
+        cluster_name_on_cloud: str,
+        status_filters: Optional[List[str]]) -> Dict[str, Dict[str, Any]]:
+    lambda_client = _get_lambda_client()
+    instances = lambda_client.list_instances()
+    possible_names = [
+        f'{cluster_name_on_cloud}-head',
+        f'{cluster_name_on_cloud}-worker',
+    ]
+
+    filtered_instances = {}
+    for instance in instances:
+        if (status_filters is not None and
+                instance['status'] not in status_filters):
+            continue
+        if instance.get('name') in possible_names:
+            filtered_instances[instance['id']] = instance
+    return filtered_instances
+
+
+def _get_head_instance_id(instances: Dict[str, Any]) -> Optional[str]:
+    head_instance_id = None
+    for instance_id, instance in instances.items():
+        if instance['name'].endswith('-head'):
+            head_instance_id = instance_id
+            break
+    return head_instance_id
+
+
+def _get_ssh_key_name(prefix: str = '') -> str:
+    lambda_client = _get_lambda_client()
+    _, public_key_path = auth.get_or_generate_keys()
+    with open(public_key_path, 'r', encoding='utf-8') as f:
+        public_key = f.read()
+    name, exists = lambda_client.get_unique_ssh_key_name(prefix, public_key)
+    if not exists:
+        raise lambda_utils.LambdaCloudError('SSH key not found')
+    return name
+
+
+def run_instances(region: str, cluster_name_on_cloud: str,
+                  config: common.ProvisionConfig) -> common.ProvisionRecord:
+    """Runs instances for the given cluster"""
+    lambda_client = _get_lambda_client()
+    pending_status = ['booting']
+    while True:
+        instances = _filter_instances(cluster_name_on_cloud, pending_status)
+        if not instances:
+            break
+        logger.info(f'Waiting for {len(instances)} instances to be ready.')
+        time.sleep(POLL_INTERVAL)
+    exist_instances = _filter_instances(cluster_name_on_cloud, ['active'])
+    head_instance_id = _get_head_instance_id(exist_instances)
+
+    to_start_count = config.count - len(exist_instances)
+    if to_start_count < 0:
+        raise RuntimeError(
+            f'Cluster {cluster_name_on_cloud} already has '
+            f'{len(exist_instances)} nodes, but {config.count} are required.')
+    if to_start_count == 0:
+        if head_instance_id is None:
+            raise RuntimeError(
+                f'Cluster {cluster_name_on_cloud} has no head node.')
+        logger.info(f'Cluster {cluster_name_on_cloud} already has '
+                    f'{len(exist_instances)} nodes, no need to start more.')
+        return common.ProvisionRecord(
+            provider_name='lambda',
+            cluster_name=cluster_name_on_cloud,
+            region=region,
+            zone=None,
+            head_instance_id=head_instance_id,
+            resumed_instance_ids=[],
+            created_instance_ids=[],
+        )
+
+    created_instance_ids = []
+    ssh_key_name = _get_ssh_key_name()
+
+    def launch_nodes(node_type: str, quantity: int) -> List[str]:
+        try:
+            instance_ids = lambda_client.create_instances(
+                instance_type=config.node_config['InstanceType'],
+                region=region,
+                name=f'{cluster_name_on_cloud}-{node_type}',
+                quantity=quantity,
+                ssh_key_name=ssh_key_name,
+            )
+            logger.info(f'Launched {len(instance_ids)} {node_type} node(s), '
+                        f'instance_ids: {instance_ids}')
+            return instance_ids
+        except Exception as e:
+            logger.warning(f'run_instances error: {e}')
+            raise
+
+    if head_instance_id is None:
+        instance_ids = launch_nodes('head', 1)
+        assert len(instance_ids) == 1
+        created_instance_ids.append(instance_ids[0])
+        head_instance_id = instance_ids[0]
+
+    assert head_instance_id is not None, 'head_instance_id should not be None'
+
+    worker_node_count = to_start_count - 1
+    if worker_node_count > 0:
+        instance_ids = launch_nodes('worker', worker_node_count)
+        created_instance_ids.extend(instance_ids)
+
+    while True:
+        instances = _filter_instances(cluster_name_on_cloud, ['active'])
+        if len(instances) == config.count:
+            break
+
+        time.sleep(POLL_INTERVAL)
+
+    return common.ProvisionRecord(
+        provider_name='lambda',
+        cluster_name=cluster_name_on_cloud,
+        region=region,
+        zone=None,
+        head_instance_id=head_instance_id,
+        resumed_instance_ids=[],
+        created_instance_ids=created_instance_ids,
+    )
+
+
+def wait_instances(region: str, cluster_name_on_cloud: str,
+                   state: Optional[status_lib.ClusterStatus]) -> None:
+    del region, cluster_name_on_cloud, state  # Unused.
+
+
+def stop_instances(
+    cluster_name_on_cloud: str,
+    provider_config: Optional[Dict[str, Any]] = None,
+    worker_only: bool = False,
+) -> None:
+    raise NotImplementedError(
+        'stop_instances is not supported for Lambda Cloud')
+
+
+def terminate_instances(
+    cluster_name_on_cloud: str,
+    provider_config: Optional[Dict[str, Any]] = None,
+    worker_only: bool = False,
+) -> None:
+    """See sky/provision/__init__.py"""
+    del provider_config
+    lambda_client = _get_lambda_client()
+    instances = _filter_instances(cluster_name_on_cloud, None)
+
+    instance_ids_to_terminate = []
+    for instance_id, instance in instances.items():
+        if worker_only and not instance['name'].endswith('-worker'):
+            continue
+        instance_ids_to_terminate.append(instance_id)
+
+    try:
+        logger.debug(
+            f'Terminating instances {", ".join(instance_ids_to_terminate)}')
+        lambda_client.remove_instances(instance_ids_to_terminate)
+    except Exception as e:  # pylint: disable=broad-except
+        with ux_utils.print_exception_no_traceback():
+            raise RuntimeError(
+                f'Failed to terminate instances {instance_ids_to_terminate}: '
+                f'{common_utils.format_exception(e, use_bracket=False)}') from e
+
+
+def get_cluster_info(
+    region: str,
+    cluster_name_on_cloud: str,
+    provider_config: Optional[Dict[str, Any]] = None,
+) -> common.ClusterInfo:
+    del region  # unused
+    running_instances = _filter_instances(cluster_name_on_cloud, ['active'])
+    instances: Dict[str, List[common.InstanceInfo]] = {}
+    head_instance_id = None
+    for instance_id, instance_info in running_instances.items():
+        instances[instance_id] = [
+            common.InstanceInfo(
+                instance_id=instance_id,
+                internal_ip=instance_info['private_ip'],
+                external_ip=instance_info['ip'],
+                ssh_port=22,
+                tags={},
+            )
+        ]
+        if instance_info['name'].endswith('-head'):
+            head_instance_id = instance_id
+
+    return common.ClusterInfo(
+        instances=instances,
+        head_instance_id=head_instance_id,
+        provider_name='lambda',
+        provider_config=provider_config,
+    )
+
+
+def query_instances(
+    cluster_name_on_cloud: str,
+    provider_config: Optional[Dict[str, Any]] = None,
+    non_terminated_only: bool = True,
+) -> Dict[str, Optional[status_lib.ClusterStatus]]:
+    """See sky/provision/__init__.py"""
+    assert provider_config is not None, (cluster_name_on_cloud, provider_config)
+    instances = _filter_instances(cluster_name_on_cloud, None)
+
+    status_map = {
+        'booting': status_lib.ClusterStatus.INIT,
+        'active': status_lib.ClusterStatus.UP,
+        'unhealthy': status_lib.ClusterStatus.INIT,
+        'terminating': status_lib.ClusterStatus.INIT,
+    }
+    statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {}
+    for instance_id, instance in instances.items():
+        status = status_map.get(instance['status'])
+        if non_terminated_only and status is None:
+            continue
+        statuses[instance_id] = status
+    return statuses
+
+
+def open_ports(
+    cluster_name_on_cloud: str,
+    ports: List[str],
+    provider_config: Optional[Dict[str, Any]] = None,
+) -> None:
+    raise NotImplementedError('open_ports is not supported for Lambda Cloud')
+
+
+def cleanup_ports(
+    cluster_name_on_cloud: str,
+    ports: List[str],
+    provider_config: Optional[Dict[str, Any]] = None,
+) -> None:
+    """See sky/provision/__init__.py"""
+    del cluster_name_on_cloud, ports, provider_config  # Unused.
diff --git a/sky/clouds/utils/lambda_utils.py b/sky/provision/lambda_cloud/lambda_utils.py
similarity index 92%
rename from sky/clouds/utils/lambda_utils.py
rename to sky/provision/lambda_cloud/lambda_utils.py
index 61c4b33ebe9..339919e80e7 100644
--- a/sky/clouds/utils/lambda_utils.py
+++ b/sky/provision/lambda_cloud/lambda_utils.py
@@ -1,4 +1,5 @@
 """Lambda Cloud helper functions."""
+
 import json
 import os
 import time
@@ -76,7 +77,7 @@ def refresh(self, instance_ids: List[str]) -> None:
 
 
 def raise_lambda_error(response: requests.Response) -> None:
-    """Raise LambdaCloudError if appropriate. """
+    """Raise LambdaCloudError if appropriate."""
     status_code = response.status_code
     if status_code == 200:
         return
@@ -131,20 +132,22 @@ def __init__(self) -> None:
         self.api_key = self._credentials['api_key']
         self.headers = {'Authorization': f'Bearer {self.api_key}'}
 
-    def create_instances(self,
-                         instance_type: str = 'gpu_1x_a100_sxm4',
-                         region: str = 'us-east-1',
-                         quantity: int = 1,
-                         name: str = '',
-                         ssh_key_name: str = '') -> List[str]:
+    def create_instances(
+        self,
+        instance_type: str = 'gpu_1x_a100_sxm4',
+        region: str = 'us-east-1',
+        quantity: int = 1,
+        name: str = '',
+        ssh_key_name: str = '',
+    ) -> List[str]:
         """Launch new instances."""
         # Optimization:
         # Most API requests are rate limited at ~1 request every second but
         # launch requests are rate limited at ~1 request every 10 seconds.
         # So don't use launch requests to check availability.
         # See https://docs.lambdalabs.com/cloud/rate-limiting/ for more.
-        available_regions = self.list_catalog()[instance_type]\
-                ['regions_with_capacity_available']
+        available_regions = (self.list_catalog()[instance_type]
+                             ['regions_with_capacity_available'])
         available_regions = [reg['name'] for reg in available_regions]
         if region not in available_regions:
             if len(available_regions) > 0:
@@ -163,27 +166,25 @@ def create_instances(self,
             'instance_type_name': instance_type,
             'ssh_key_names': [ssh_key_name],
             'quantity': quantity,
-            'name': name
+            'name': name,
         })
         response = _try_request_with_backoff(
             'post',
             f'{API_ENDPOINT}/instance-operations/launch',
             data=data,
-            headers=self.headers)
+            headers=self.headers,
+        )
         return response.json().get('data', []).get('instance_ids', [])
 
-    def remove_instances(self, *instance_ids: str) -> Dict[str, Any]:
+    def remove_instances(self, instance_ids: List[str]) -> Dict[str, Any]:
         """Terminate instances."""
-        data = json.dumps({
-            'instance_ids': [
-                instance_ids[0]  # TODO(ewzeng) don't hardcode
-            ]
-        })
+        data = json.dumps({'instance_ids': instance_ids})
         response = _try_request_with_backoff(
             'post',
             f'{API_ENDPOINT}/instance-operations/terminate',
             data=data,
-            headers=self.headers)
+            headers=self.headers,
+        )
         return response.json().get('data', []).get('terminated_instances', [])
 
     def list_instances(self) -> List[Dict[str, Any]]:
diff --git a/sky/provision/paperspace/utils.py b/sky/provision/paperspace/utils.py
index db2da7b4610..d9eceefba19 100644
--- a/sky/provision/paperspace/utils.py
+++ b/sky/provision/paperspace/utils.py
@@ -132,6 +132,8 @@ def set_sky_key_script(self, public_key: str) -> None:
             'apt-get update \n'
             'apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin \n'  # pylint: disable=line-too-long
             'fi \n'
+            # TODO(tian): Maybe remove this as well since we are now adding
+            # users to docker group in the DockerInitializer. Need to test.
             'usermod -aG docker paperspace \n'
             f'echo "{public_key}" >> /home/paperspace/.ssh/authorized_keys \n')
         try:
diff --git a/sky/provision/provisioner.py b/sky/provision/provisioner.py
index b2ac6d6660f..7706a3d489b 100644
--- a/sky/provision/provisioner.py
+++ b/sky/provision/provisioner.py
@@ -571,7 +571,10 @@ def post_provision_runtime_setup(
                                          provision_record=provision_record,
                                          custom_resource=custom_resource)
         except Exception:  # pylint: disable=broad-except
-            logger.error('*** Failed setting up cluster. ***')
+            logger.error(
+                ux_utils.error_message(
+                    'Failed to set up SkyPilot runtime on cluster.',
+                    provision_logging.config.log_path))
             logger.debug(f'Stacktrace:\n{traceback.format_exc()}')
             with ux_utils.print_exception_no_traceback():
                 raise
diff --git a/sky/resources.py b/sky/resources.py
index e9a522cef48..384f2b6a548 100644
--- a/sky/resources.py
+++ b/sky/resources.py
@@ -842,12 +842,6 @@ def _try_validate_image_id(self) -> None:
 
         if self.extract_docker_image() is not None:
             # TODO(tian): validate the docker image exists / of reasonable size
-            if self.accelerators is not None:
-                for acc in self.accelerators.keys():
-                    if acc.lower().startswith('tpu'):
-                        with ux_utils.print_exception_no_traceback():
-                            raise ValueError(
-                                'Docker image is not supported for TPU VM.')
             if self.cloud is not None:
                 self.cloud.check_features_are_supported(
                     self, {clouds.CloudImplementationFeatures.DOCKER_IMAGE})
@@ -1032,6 +1026,12 @@ def make_deploy_variables(self, cluster_name: resources_utils.ClusterName,
                 self.accelerators is not None):
             initial_setup_commands = [constants.DISABLE_GPU_ECC_COMMAND]
 
+        docker_image = self.extract_docker_image()
+
+        # Cloud specific variables
+        cloud_specific_variables = self.cloud.make_deploy_resources_variables(
+            self, cluster_name, region, zones, dryrun)
+
         # Docker run options
         docker_run_options = skypilot_config.get_nested(
             ('docker', 'run_options'),
@@ -1039,18 +1039,17 @@ def make_deploy_variables(self, cluster_name: resources_utils.ClusterName,
             override_configs=self.cluster_config_overrides)
         if isinstance(docker_run_options, str):
             docker_run_options = [docker_run_options]
+        # Special accelerator runtime might require additional docker run
+        # options. e.g., for TPU, we need --privileged.
+        if 'docker_run_options' in cloud_specific_variables:
+            docker_run_options.extend(
+                cloud_specific_variables['docker_run_options'])
         if docker_run_options and isinstance(self.cloud, clouds.Kubernetes):
             logger.warning(
                 f'{colorama.Style.DIM}Docker run options are specified, '
                 'but ignored for Kubernetes: '
                 f'{" ".join(docker_run_options)}'
                 f'{colorama.Style.RESET_ALL}')
-
-        docker_image = self.extract_docker_image()
-
-        # Cloud specific variables
-        cloud_specific_variables = self.cloud.make_deploy_resources_variables(
-            self, cluster_name, region, zones, dryrun)
         return dict(
             cloud_specific_variables,
             **{
diff --git a/sky/serve/__init__.py b/sky/serve/__init__.py
index d85b6e9311e..f93495809c3 100644
--- a/sky/serve/__init__.py
+++ b/sky/serve/__init__.py
@@ -8,6 +8,7 @@
 from sky.serve.core import down
 from sky.serve.core import status
 from sky.serve.core import tail_logs
+from sky.serve.core import terminate_replica
 from sky.serve.core import up
 from sky.serve.core import update
 from sky.serve.serve_state import ReplicaStatus
@@ -42,6 +43,7 @@
     'SKY_SERVE_CONTROLLER_NAME',
     'SKYSERVE_METADATA_DIR',
     'status',
+    'terminate_replica',
     'tail_logs',
     'up',
     'update',
diff --git a/sky/serve/constants.py b/sky/serve/constants.py
index 7775c3f8a6e..3974293190e 100644
--- a/sky/serve/constants.py
+++ b/sky/serve/constants.py
@@ -92,4 +92,11 @@
 # change for the serve_utils.ServeCodeGen, we need to bump this version, so that
 # the user can be notified to update their SkyPilot serve version on the remote
 # cluster.
-SERVE_VERSION = 1
+# Changelog:
+# v1.0 - Introduce rolling update.
+# v2.0 - Added template-replica feature.
+SERVE_VERSION = 2
+
+TERMINATE_REPLICA_VERSION_MISMATCH_ERROR = (
+    'The version of service is outdated and does not support manually '
+    'terminating replicas. Please terminate the service and spin up again.')
diff --git a/sky/serve/controller.py b/sky/serve/controller.py
index 580964273ef..75d14b76079 100644
--- a/sky/serve/controller.py
+++ b/sky/serve/controller.py
@@ -9,6 +9,7 @@
 import traceback
 from typing import Any, Dict, List
 
+import colorama
 import fastapi
 from fastapi import responses
 import uvicorn
@@ -157,6 +158,75 @@ async def update_service(request: fastapi.Request) -> fastapi.Response:
                 return responses.JSONResponse(content={'message': 'Error'},
                                               status_code=500)
 
+        @self._app.post('/controller/terminate_replica')
+        async def terminate_replica(
+                request: fastapi.Request) -> fastapi.Response:
+            request_data = await request.json()
+            replica_id = request_data['replica_id']
+            assert isinstance(replica_id,
+                              int), 'Error: replica ID must be an integer.'
+            purge = request_data['purge']
+            assert isinstance(purge, bool), 'Error: purge must be a boolean.'
+            replica_info = serve_state.get_replica_info_from_id(
+                self._service_name, replica_id)
+            assert replica_info is not None, (f'Error: replica '
+                                              f'{replica_id} does not exist.')
+            replica_status = replica_info.status
+
+            if replica_status == serve_state.ReplicaStatus.SHUTTING_DOWN:
+                return responses.JSONResponse(
+                    status_code=409,
+                    content={
+                        'message':
+                            f'Replica {replica_id} of service '
+                            f'{self._service_name!r} is already in the process '
+                            f'of terminating. Skip terminating now.'
+                    })
+
+            if (replica_status in serve_state.ReplicaStatus.failed_statuses()
+                    and not purge):
+                return responses.JSONResponse(
+                    status_code=409,
+                    content={
+                        'message': f'{colorama.Fore.YELLOW}Replica '
+                                   f'{replica_id} of service '
+                                   f'{self._service_name!r} is in failed '
+                                   f'status ({replica_info.status}). '
+                                   f'Skipping its termination as it could '
+                                   f'lead to a resource leak. '
+                                   f'(Use `sky serve down '
+                                   f'{self._service_name!r} --replica-id '
+                                   f'{replica_id} --purge` to '
+                                   'forcefully terminate the replica.)'
+                                   f'{colorama.Style.RESET_ALL}'
+                    })
+
+            self._replica_manager.scale_down(replica_id, purge=purge)
+
+            action = 'terminated' if not purge else 'purged'
+            message = (f'{colorama.Fore.GREEN}Replica {replica_id} of service '
+                       f'{self._service_name!r} is scheduled to be '
+                       f'{action}.{colorama.Style.RESET_ALL}\n'
+                       f'Please use {ux_utils.BOLD}sky serve status '
+                       f'{self._service_name}{ux_utils.RESET_BOLD} '
+                       f'to check the latest status.')
+            return responses.JSONResponse(status_code=200,
+                                          content={'message': message})
+
+        @self._app.exception_handler(Exception)
+        async def validation_exception_handler(
+                request: fastapi.Request, exc: Exception) -> fastapi.Response:
+            with ux_utils.enable_traceback():
+                logger.error(f'Error in controller: {exc!r}')
+            return responses.JSONResponse(
+                status_code=500,
+                content={
+                    'message':
+                        (f'Failed method {request.method} at URL {request.url}.'
+                         f' Exception message is {exc!r}.')
+                },
+            )
+
         threading.Thread(target=self._run_autoscaler).start()
 
         logger.info('SkyServe Controller started on '
diff --git a/sky/serve/core.py b/sky/serve/core.py
index 3ad260213f1..691a3edea0b 100644
--- a/sky/serve/core.py
+++ b/sky/serve/core.py
@@ -503,6 +503,53 @@ def down(
     sky_logging.print(stdout)
 
 
+@usage_lib.entrypoint
+def terminate_replica(service_name: str, replica_id: int, purge: bool) -> None:
+    """Tear down a specific replica for the given service.
+
+    Args:
+        service_name: Name of the service.
+        replica_id: ID of replica to terminate.
+        purge: Whether to terminate replicas in a failed status. These replicas
+          may lead to resource leaks, so we require the user to explicitly
+          specify this flag to make sure they are aware of this potential
+          resource leak.
+
+    Raises:
+        sky.exceptions.ClusterNotUpError: if the sky sere controller is not up.
+        RuntimeError: if failed to terminate the replica.
+    """
+    handle = backend_utils.is_controller_accessible(
+        controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
+        stopped_message=
+        'No service is running now. Please spin up a service first.',
+        non_existent_message='No service is running now. '
+        'Please spin up a service first.',
+    )
+
+    backend = backend_utils.get_backend_from_handle(handle)
+    assert isinstance(backend, backends.CloudVmRayBackend)
+
+    code = serve_utils.ServeCodeGen.terminate_replica(service_name, replica_id,
+                                                      purge)
+    returncode, stdout, stderr = backend.run_on_head(handle,
+                                                     code,
+                                                     require_outputs=True,
+                                                     stream_logs=False,
+                                                     separate_stderr=True)
+
+    try:
+        subprocess_utils.handle_returncode(returncode,
+                                           code,
+                                           'Failed to terminate the replica',
+                                           stderr,
+                                           stream_logs=True)
+    except exceptions.CommandError as e:
+        raise RuntimeError(e.error_msg) from e
+
+    sky_logging.print(stdout)
+
+
 @usage_lib.entrypoint
 def status(
     service_names: Optional[Union[str,
diff --git a/sky/serve/replica_managers.py b/sky/serve/replica_managers.py
index 337b28ba61b..c0e5220e779 100644
--- a/sky/serve/replica_managers.py
+++ b/sky/serve/replica_managers.py
@@ -247,6 +247,8 @@ class ReplicaStatusProperty:
     is_scale_down: bool = False
     # The replica's spot instance was preempted.
     preempted: bool = False
+    # Whether the replica is purged.
+    purged: bool = False
 
     def remove_terminated_replica(self) -> bool:
         """Whether to remove the replica record from the replica table.
@@ -307,6 +309,8 @@ def should_track_service_status(self) -> bool:
             return False
         if self.preempted:
             return False
+        if self.purged:
+            return False
         return True
 
     def to_replica_status(self) -> serve_state.ReplicaStatus:
@@ -590,7 +594,7 @@ def scale_up(self,
         """
         raise NotImplementedError
 
-    def scale_down(self, replica_id: int) -> None:
+    def scale_down(self, replica_id: int, purge: bool = False) -> None:
         """Scale down replica with replica_id."""
         raise NotImplementedError
 
@@ -679,7 +683,8 @@ def _terminate_replica(self,
                            replica_id: int,
                            sync_down_logs: bool,
                            replica_drain_delay_seconds: int,
-                           is_scale_down: bool = False) -> None:
+                           is_scale_down: bool = False,
+                           purge: bool = False) -> None:
 
         if replica_id in self._launch_process_pool:
             info = serve_state.get_replica_info_from_id(self._service_name,
@@ -763,16 +768,18 @@ def _download_and_stream_logs(info: ReplicaInfo):
         )
         info.status_property.sky_down_status = ProcessStatus.RUNNING
         info.status_property.is_scale_down = is_scale_down
+        info.status_property.purged = purge
         serve_state.add_or_update_replica(self._service_name, replica_id, info)
         p.start()
         self._down_process_pool[replica_id] = p
 
-    def scale_down(self, replica_id: int) -> None:
+    def scale_down(self, replica_id: int, purge: bool = False) -> None:
         self._terminate_replica(
             replica_id,
             sync_down_logs=False,
             replica_drain_delay_seconds=_DEFAULT_DRAIN_SECONDS,
-            is_scale_down=True)
+            is_scale_down=True,
+            purge=purge)
 
     def _handle_preemption(self, info: ReplicaInfo) -> bool:
         """Handle preemption of the replica if any error happened.
@@ -911,6 +918,8 @@ def _refresh_process_pool(self) -> None:
                 # since user should fixed the error before update.
                 elif info.version != self.latest_version:
                     removal_reason = 'for version outdated'
+                elif info.status_property.purged:
+                    removal_reason = 'for purge'
                 else:
                     logger.info(f'Termination of replica {replica_id} '
                                 'finished. Replica info is kept since some '
diff --git a/sky/serve/serve_utils.py b/sky/serve/serve_utils.py
index 0ecf34135a7..3a416dd2932 100644
--- a/sky/serve/serve_utils.py
+++ b/sky/serve/serve_utils.py
@@ -246,9 +246,11 @@ def set_service_status_and_active_versions_from_replica(
         update_mode: UpdateMode) -> None:
     record = serve_state.get_service_from_name(service_name)
     if record is None:
-        raise ValueError('The service is up-ed in an old version and does not '
-                         'support update. Please `sky serve down` '
-                         'it first and relaunch the service.')
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(
+                'The service is up-ed in an old version and does not '
+                'support update. Please `sky serve down` '
+                'it first and relaunch the service.')
     if record['status'] == serve_state.ServiceStatus.SHUTTING_DOWN:
         # When the service is shutting down, there is a period of time which the
         # controller still responds to the request, and the replica is not
@@ -289,7 +291,8 @@ def update_service_status() -> None:
 def update_service_encoded(service_name: str, version: int, mode: str) -> str:
     service_status = _get_service_status(service_name)
     if service_status is None:
-        raise ValueError(f'Service {service_name!r} does not exist.')
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(f'Service {service_name!r} does not exist.')
     controller_port = service_status['controller_port']
     resp = requests.post(
         _CONTROLLER_URL.format(CONTROLLER_PORT=controller_port) +
@@ -299,20 +302,56 @@ def update_service_encoded(service_name: str, version: int, mode: str) -> str:
             'mode': mode,
         })
     if resp.status_code == 404:
-        raise ValueError('The service is up-ed in an old version and does not '
-                         'support update. Please `sky serve down` '
-                         'it first and relaunch the service. ')
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(
+                'The service is up-ed in an old version and does not '
+                'support update. Please `sky serve down` '
+                'it first and relaunch the service. ')
     elif resp.status_code == 400:
-        raise ValueError(f'Client error during service update: {resp.text}')
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(f'Client error during service update: {resp.text}')
     elif resp.status_code == 500:
-        raise RuntimeError(f'Server error during service update: {resp.text}')
+        with ux_utils.print_exception_no_traceback():
+            raise RuntimeError(
+                f'Server error during service update: {resp.text}')
     elif resp.status_code != 200:
-        raise ValueError(f'Failed to update service: {resp.text}')
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(f'Failed to update service: {resp.text}')
 
     service_msg = resp.json()['message']
     return common_utils.encode_payload(service_msg)
 
 
+def terminate_replica(service_name: str, replica_id: int, purge: bool) -> str:
+    service_status = _get_service_status(service_name)
+    if service_status is None:
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(f'Service {service_name!r} does not exist.')
+    replica_info = serve_state.get_replica_info_from_id(service_name,
+                                                        replica_id)
+    if replica_info is None:
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(
+                f'Replica {replica_id} for service {service_name} does not '
+                'exist.')
+
+    controller_port = service_status['controller_port']
+    resp = requests.post(
+        _CONTROLLER_URL.format(CONTROLLER_PORT=controller_port) +
+        '/controller/terminate_replica',
+        json={
+            'replica_id': replica_id,
+            'purge': purge,
+        })
+
+    message: str = resp.json()['message']
+    if resp.status_code != 200:
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(f'Failed to terminate replica {replica_id} '
+                             f'in {service_name}. Reason:\n{message}')
+    return message
+
+
 def _get_service_status(
         service_name: str,
         with_replica_info: bool = True) -> Optional[Dict[str, Any]]:
@@ -644,8 +683,9 @@ def _get_replica_status() -> serve_state.ReplicaStatus:
         for info in replica_info:
             if info.replica_id == replica_id:
                 return info.status
-        raise ValueError(
-            _FAILED_TO_FIND_REPLICA_MSG.format(replica_id=replica_id))
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError(
+                _FAILED_TO_FIND_REPLICA_MSG.format(replica_id=replica_id))
 
     finish_stream = (
         lambda: _get_replica_status() != serve_state.ReplicaStatus.PROVISIONING)
@@ -735,7 +775,7 @@ def _get_replicas(service_record: Dict[str, Any]) -> str:
 
 
 def get_endpoint(service_record: Dict[str, Any]) -> str:
-    # Don't use backend_utils.is_controller_up since it is too slow.
+    # Don't use backend_utils.is_controller_accessible since it is too slow.
     handle = global_user_state.get_handle_from_cluster_name(
         SKY_SERVE_CONTROLLER_NAME)
     assert isinstance(handle, backends.CloudVmRayResourceHandle)
@@ -915,6 +955,18 @@ def terminate_services(cls, service_names: Optional[List[str]],
         ]
         return cls._build(code)
 
+    @classmethod
+    def terminate_replica(cls, service_name: str, replica_id: int,
+                          purge: bool) -> str:
+        code = [
+            f'(lambda: print(serve_utils.terminate_replica({service_name!r}, '
+            f'{replica_id}, {purge}), end="", flush=True) '
+            'if getattr(constants, "SERVE_VERSION", 0) >= 2 else '
+            f'exec("raise RuntimeError('
+            f'{constants.TERMINATE_REPLICA_VERSION_MISMATCH_ERROR!r})"))()'
+        ]
+        return cls._build(code)
+
     @classmethod
     def wait_service_registration(cls, service_name: str, job_id: int) -> str:
         code = [
diff --git a/sky/setup_files/MANIFEST.in b/sky/setup_files/MANIFEST.in
index 54ab3b55a32..0cd93f485e0 100644
--- a/sky/setup_files/MANIFEST.in
+++ b/sky/setup_files/MANIFEST.in
@@ -6,7 +6,6 @@ include sky/setup_files/*
 include sky/skylet/*.sh
 include sky/skylet/LICENSE
 include sky/skylet/providers/ibm/*
-include sky/skylet/providers/lambda_cloud/*
 include sky/skylet/providers/oci/*
 include sky/skylet/providers/scp/*
 include sky/skylet/providers/*.py
diff --git a/sky/skylet/constants.py b/sky/skylet/constants.py
index 5729d75c968..032ad5d25b1 100644
--- a/sky/skylet/constants.py
+++ b/sky/skylet/constants.py
@@ -155,8 +155,8 @@
     # We use --system-site-packages to reuse the system site packages to avoid
     # the overhead of installing the same packages in the new environment.
     f'[ -d {SKY_REMOTE_PYTHON_ENV} ] || '
-    f'{{ {SKY_PYTHON_CMD} -m venv {SKY_REMOTE_PYTHON_ENV} --system-site-packages && '
-    f'echo "$(echo {SKY_REMOTE_PYTHON_ENV})/bin/python" > {SKY_PYTHON_PATH_FILE}; }};'
+    f'{SKY_PYTHON_CMD} -m venv {SKY_REMOTE_PYTHON_ENV} --system-site-packages;'
+    f'echo "$(echo {SKY_REMOTE_PYTHON_ENV})/bin/python" > {SKY_PYTHON_PATH_FILE};'
 )
 
 _sky_version = str(version.parse(sky.__version__))
diff --git a/sky/skylet/providers/lambda_cloud/__init__.py b/sky/skylet/providers/lambda_cloud/__init__.py
deleted file mode 100644
index 64dac295eb5..00000000000
--- a/sky/skylet/providers/lambda_cloud/__init__.py
+++ /dev/null
@@ -1,2 +0,0 @@
-"""Lambda Cloud node provider"""
-from sky.skylet.providers.lambda_cloud.node_provider import LambdaNodeProvider
diff --git a/sky/skylet/providers/lambda_cloud/node_provider.py b/sky/skylet/providers/lambda_cloud/node_provider.py
deleted file mode 100644
index 557afe75568..00000000000
--- a/sky/skylet/providers/lambda_cloud/node_provider.py
+++ /dev/null
@@ -1,320 +0,0 @@
-import logging
-import os
-from threading import RLock
-import time
-from typing import Any, Dict, List, Optional
-
-from ray.autoscaler.node_provider import NodeProvider
-from ray.autoscaler.tags import NODE_KIND_HEAD
-from ray.autoscaler.tags import NODE_KIND_WORKER
-from ray.autoscaler.tags import STATUS_UP_TO_DATE
-from ray.autoscaler.tags import TAG_RAY_CLUSTER_NAME
-from ray.autoscaler.tags import TAG_RAY_NODE_KIND
-from ray.autoscaler.tags import TAG_RAY_NODE_NAME
-from ray.autoscaler.tags import TAG_RAY_NODE_STATUS
-from ray.autoscaler.tags import TAG_RAY_USER_NODE_TYPE
-
-from sky import authentication as auth
-from sky.clouds.utils import lambda_utils
-from sky.utils import command_runner
-from sky.utils import common_utils
-from sky.utils import subprocess_utils
-from sky.utils import ux_utils
-
-_TAG_PATH_PREFIX = '~/.sky/generated/lambda_cloud/metadata'
-_REMOTE_SSH_KEY_NAME = '~/.lambda_cloud/ssh_key_name'
-_REMOTE_RAY_SSH_KEY = '~/ray_bootstrap_key.pem'
-_REMOTE_RAY_YAML = '~/ray_bootstrap_config.yaml'
-_GET_INTERNAL_IP_CMD = 's=$(ip -4 -br addr show | grep UP); echo "$s"; echo "$s" | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1])|104\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'
-
-logger = logging.getLogger(__name__)
-
-
-def synchronized(f):
-
-    def wrapper(self, *args, **kwargs):
-        self.lock.acquire()
-        try:
-            return f(self, *args, **kwargs)
-        finally:
-            self.lock.release()
-
-    return wrapper
-
-
-class LambdaNodeProvider(NodeProvider):
-    """Node Provider for Lambda Cloud.
-
-    This provider assumes Lambda Cloud credentials are set.
-    """
-
-    def __init__(self, provider_config: Dict[str, Any],
-                 cluster_name: str) -> None:
-        NodeProvider.__init__(self, provider_config, cluster_name)
-        self.lock = RLock()
-        self.lambda_client = lambda_utils.LambdaCloudClient()
-        self.cached_nodes: Dict[str, Dict[str, Any]] = {}
-        self.metadata = lambda_utils.Metadata(_TAG_PATH_PREFIX, cluster_name)
-        self.ssh_key_path = os.path.expanduser(auth.PRIVATE_SSH_KEY_PATH)
-
-        def _get_ssh_key_name(prefix: str) -> str:
-            public_key_path = os.path.expanduser(auth.PUBLIC_SSH_KEY_PATH)
-            with open(public_key_path, 'r') as f:
-                public_key = f.read()
-            name, exists = self.lambda_client.get_unique_ssh_key_name(
-                prefix, public_key)
-            if not exists:
-                raise lambda_utils.LambdaCloudError('SSH key not found')
-            return name
-
-        ray_yaml_path = os.path.expanduser(_REMOTE_RAY_YAML)
-        self.on_head = (os.path.exists(ray_yaml_path) and
-                        common_utils.read_yaml(ray_yaml_path)['cluster_name']
-                        == cluster_name)
-
-        if self.on_head:
-            self.ssh_key_path = os.path.expanduser(_REMOTE_RAY_SSH_KEY)
-            ssh_key_name_path = os.path.expanduser(_REMOTE_SSH_KEY_NAME)
-            if os.path.exists(ssh_key_name_path):
-                with open(ssh_key_name_path, 'r') as f:
-                    self.ssh_key_name = f.read()
-            else:
-                # At this point, `~/.ssh/sky-key.pub` contains the public
-                # key used to launch this cluster. Use it to determine
-                # ssh key name and store the name in _REMOTE_SSH_KEY_NAME.
-                # Note: this case only runs during cluster launch, so it is
-                # not possible for ~/.ssh/sky-key.pub to already be regenerated
-                # by the user.
-                self.ssh_key_name = _get_ssh_key_name('')
-                with open(ssh_key_name_path, 'w', encoding='utf-8') as f:
-                    f.write(self.ssh_key_name)
-        else:
-            # On local
-            self.ssh_key_name = _get_ssh_key_name(
-                f'sky-key-{common_utils.get_user_hash()}')
-
-    def _guess_and_add_missing_tags(self, vms: List[Dict[str, Any]]) -> None:
-        """Adds missing vms to local tag file and guesses their tags."""
-        for node in vms:
-            if self.metadata.get(node['id']) is not None:
-                pass
-            elif node['name'] == f'{self.cluster_name}-head':
-                self.metadata.set(
-                    node['id'], {
-                        'tags': {
-                            TAG_RAY_CLUSTER_NAME: self.cluster_name,
-                            TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE,
-                            TAG_RAY_NODE_KIND: NODE_KIND_HEAD,
-                            TAG_RAY_USER_NODE_TYPE: 'ray_head_default',
-                            TAG_RAY_NODE_NAME: f'ray-{self.cluster_name}-head',
-                        }
-                    })
-            elif node['name'] == f'{self.cluster_name}-worker':
-                self.metadata.set(
-                    node['id'], {
-                        'tags': {
-                            TAG_RAY_CLUSTER_NAME: self.cluster_name,
-                            TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE,
-                            TAG_RAY_NODE_KIND: NODE_KIND_WORKER,
-                            TAG_RAY_USER_NODE_TYPE: 'ray_worker_default',
-                            TAG_RAY_NODE_NAME: f'ray-{self.cluster_name}-worker',
-                        }
-                    })
-
-    def _list_instances_in_cluster(self) -> List[Dict[str, Any]]:
-        """List running instances in cluster."""
-        vms = self.lambda_client.list_instances()
-        possible_names = [
-            f'{self.cluster_name}-head', f'{self.cluster_name}-worker'
-        ]
-        return [node for node in vms if node.get('name') in possible_names]
-
-    @synchronized
-    def _get_filtered_nodes(self, tag_filters: Dict[str,
-                                                    str]) -> Dict[str, Any]:
-
-        def _extract_metadata(vm: Dict[str, Any]) -> Dict[str, Any]:
-            metadata = {'id': vm['id'], 'status': vm['status'], 'tags': {}}
-            instance_info = self.metadata.get(vm['id'])
-            if instance_info is not None:
-                metadata['tags'] = instance_info['tags']
-            metadata['external_ip'] = vm.get('ip')
-            return metadata
-
-        def _match_tags(vm: Dict[str, Any]):
-            vm_info = self.metadata.get(vm['id'])
-            tags = {} if vm_info is None else vm_info['tags']
-            for k, v in tag_filters.items():
-                if tags.get(k) != v:
-                    return False
-            return True
-
-        def _get_internal_ip(node: Dict[str, Any]):
-            # TODO(ewzeng): cache internal ips in metadata file to reduce
-            # ssh overhead.
-            if node['external_ip'] is None or node['status'] != 'active':
-                node['internal_ip'] = None
-                return
-            runner = command_runner.SSHCommandRunner(
-                node=(node['external_ip'], 22),
-                ssh_user='ubuntu',
-                ssh_private_key=self.ssh_key_path)
-            rc, stdout, stderr = runner.run(_GET_INTERNAL_IP_CMD,
-                                            require_outputs=True,
-                                            stream_logs=False)
-            subprocess_utils.handle_returncode(
-                rc,
-                _GET_INTERNAL_IP_CMD,
-                'Failed get obtain private IP from node',
-                stderr=stdout + stderr)
-            node['internal_ip'] = stdout.strip()
-
-        vms = self._list_instances_in_cluster()
-        self.metadata.refresh([node['id'] for node in vms])
-        self._guess_and_add_missing_tags(vms)
-        nodes = [_extract_metadata(vm) for vm in filter(_match_tags, vms)]
-        nodes = [
-            node for node in nodes
-            if node['status'] not in ['terminating', 'terminated']
-        ]
-        subprocess_utils.run_in_parallel(_get_internal_ip, nodes)
-        self.cached_nodes = {node['id']: node for node in nodes}
-        return self.cached_nodes
-
-    def non_terminated_nodes(self, tag_filters: Dict[str, str]) -> List[str]:
-        """Return a list of node ids filtered by the specified tags dict.
-
-        This list must not include terminated nodes. For performance reasons,
-        providers are allowed to cache the result of a call to
-        non_terminated_nodes() to serve single-node queries
-        (e.g. is_running(node_id)). This means that non_terminated_nodes() must
-        be called again to refresh results.
-
-        Examples:
-            >>> provider.non_terminated_nodes({TAG_RAY_NODE_KIND: "worker"})
-            ["node-1", "node-2"]
-        """
-        nodes = self._get_filtered_nodes(tag_filters=tag_filters)
-        return [k for k, _ in nodes.items()]
-
-    def is_running(self, node_id: str) -> bool:
-        """Return whether the specified node is running."""
-        return self._get_cached_node(node_id=node_id) is not None
-
-    def is_terminated(self, node_id: str) -> bool:
-        """Return whether the specified node is terminated."""
-        return self._get_cached_node(node_id=node_id) is None
-
-    def node_tags(self, node_id: str) -> Dict[str, str]:
-        """Returns the tags of the given node (string dict)."""
-        node = self._get_cached_node(node_id=node_id)
-        if node is None:
-            return {}
-        return node['tags']
-
-    def external_ip(self, node_id: str) -> Optional[str]:
-        """Returns the external ip of the given node."""
-        node = self._get_cached_node(node_id=node_id)
-        if node is None:
-            return None
-        ip = node.get('external_ip')
-        with ux_utils.print_exception_no_traceback():
-            if ip is None:
-                raise lambda_utils.LambdaCloudError(
-                    'A node ip address was not found. Either '
-                    '(1) Lambda Cloud has internally errored, or '
-                    '(2) the cluster is still booting. '
-                    'You can manually terminate the cluster on the '
-                    'Lambda Cloud console or (in case 2) wait for '
-                    'booting to finish (~2 minutes).')
-        return ip
-
-    def internal_ip(self, node_id: str) -> Optional[str]:
-        """Returns the internal ip (Ray ip) of the given node."""
-        node = self._get_cached_node(node_id=node_id)
-        if node is None:
-            return None
-        ip = node.get('internal_ip')
-        with ux_utils.print_exception_no_traceback():
-            if ip is None:
-                raise lambda_utils.LambdaCloudError(
-                    'A node ip address was not found. Either '
-                    '(1) Lambda Cloud has internally errored, or '
-                    '(2) the cluster is still booting. '
-                    'You can manually terminate the cluster on the '
-                    'Lambda Cloud console or (in case 2) wait for '
-                    'booting to finish (~2 minutes).')
-        return ip
-
-    def create_node(self, node_config: Dict[str, Any], tags: Dict[str, str],
-                    count: int) -> None:
-        """Creates a number of nodes within the namespace."""
-        # Get tags
-        config_tags = node_config.get('tags', {}).copy()
-        config_tags.update(tags)
-        config_tags[TAG_RAY_CLUSTER_NAME] = self.cluster_name
-
-        # Create nodes
-        instance_type = node_config['InstanceType']
-        region = self.provider_config['region']
-
-        if config_tags[TAG_RAY_NODE_KIND] == NODE_KIND_HEAD:
-            name = f'{self.cluster_name}-head'
-            # Occasionally, the head node will continue running for a short
-            # period after termination. This can lead to the following bug:
-            #   1. Head node autodowns but continues running.
-            #   2. The next autodown event is triggered, which executes ray up.
-            #   3. Head node stops running.
-            # In this case, a new head node is created after the cluster has
-            # terminated. We avoid this with the following check:
-            if self.on_head:
-                raise lambda_utils.LambdaCloudError('Head already exists.')
-        else:
-            name = f'{self.cluster_name}-worker'
-
-        # Lambda launch api only supports launching one node at a time,
-        # so we do a loop. Remove loop when launch api allows quantity > 1
-        booting_list = []
-        for _ in range(count):
-            vm_id = self.lambda_client.create_instances(
-                instance_type=instance_type,
-                region=region,
-                quantity=1,
-                name=name,
-                ssh_key_name=self.ssh_key_name)[0]
-            self.metadata.set(vm_id, {'tags': config_tags})
-            booting_list.append(vm_id)
-            time.sleep(10)  # Avoid api rate limits
-
-        # Wait for nodes to finish booting
-        while True:
-            vms = self._list_instances_in_cluster()
-            for vm_id in booting_list.copy():
-                for vm in vms:
-                    if vm['id'] == vm_id and vm['status'] == 'active':
-                        booting_list.remove(vm_id)
-            if len(booting_list) == 0:
-                return
-            time.sleep(10)
-
-    @synchronized
-    def set_node_tags(self, node_id: str, tags: Dict[str, str]) -> None:
-        """Sets the tag values (string dict) for the specified node."""
-        node = self._get_node(node_id)
-        assert node is not None, node_id
-        node['tags'].update(tags)
-        self.metadata.set(node_id, {'tags': node['tags']})
-
-    def terminate_node(self, node_id: str) -> None:
-        """Terminates the specified node."""
-        self.lambda_client.remove_instances(node_id)
-        self.metadata.set(node_id, None)
-
-    def _get_node(self, node_id: str) -> Optional[Dict[str, Any]]:
-        self._get_filtered_nodes({})  # Side effect: updates cache
-        return self.cached_nodes.get(node_id, None)
-
-    def _get_cached_node(self, node_id: str) -> Optional[Dict[str, Any]]:
-        if node_id in self.cached_nodes:
-            return self.cached_nodes[node_id]
-        return self._get_node(node_id=node_id)
diff --git a/sky/templates/gcp-ray.yml.j2 b/sky/templates/gcp-ray.yml.j2
index 5f06eef05c7..f3e6232d5d8 100644
--- a/sky/templates/gcp-ray.yml.j2
+++ b/sky/templates/gcp-ray.yml.j2
@@ -64,6 +64,9 @@ provider:
   # leakage.
   disable_launch_config_check: true
   use_managed_instance_group: {{ gcp_use_managed_instance_group }}
+{%- if enable_gvnic %}
+  enable_gvnic: {{ enable_gvnic }}
+{%- endif %}
 
 auth:
   ssh_user: gcpuser
diff --git a/sky/templates/lambda-ray.yml.j2 b/sky/templates/lambda-ray.yml.j2
index 6b6d94cfb3c..5df3655c566 100644
--- a/sky/templates/lambda-ray.yml.j2
+++ b/sky/templates/lambda-ray.yml.j2
@@ -5,9 +5,29 @@ max_workers: {{num_nodes - 1}}
 upscaling_speed: {{num_nodes - 1}}
 idle_timeout_minutes: 60
 
+{%- if docker_image is not none %}
+docker:
+  image: {{docker_image}}
+  container_name: {{docker_container_name}}
+  run_options:
+    - --ulimit nofile=1048576:1048576
+    {%- for run_option in docker_run_options %}
+    - {{run_option}}
+    {%- endfor %}
+  {%- if docker_login_config is not none %}
+  docker_login_config:
+    username: |-
+      {{docker_login_config.username}}
+    password: |-
+      {{docker_login_config.password}}
+    server: |-
+      {{docker_login_config.server}}
+  {%- endif %}
+{%- endif %}
+
 provider:
   type: external
-  module: sky.skylet.providers.lambda_cloud.LambdaNodeProvider
+  module: sky.provision.lambda
   region: {{region}}
   # Disable launch config check for worker nodes as it can cause resource
   # leakage.
@@ -25,14 +45,6 @@ available_node_types:
     resources: {}
     node_config:
       InstanceType: {{instance_type}}
-{% if num_nodes > 1 %}
-  ray_worker_default:
-    min_workers: {{num_nodes - 1}}
-    max_workers: {{num_nodes - 1}}
-    resources: {}
-    node_config:
-      InstanceType: {{instance_type}}
-{%- endif %}
 
 head_node_type: ray_head_default
 
@@ -64,7 +76,10 @@ setup_commands:
   # Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
   # Line 'mkdir -p ..': disable host key check
   # Line 'python3 -c ..': patch the buggy ray files and enable `-o allow_other` option for `goofys`
-  - sudo systemctl stop unattended-upgrades || true;
+  - {%- for initial_setup_command in initial_setup_commands %}
+    {{ initial_setup_command }}
+    {%- endfor %}
+    sudo systemctl stop unattended-upgrades || true;
     sudo systemctl disable unattended-upgrades || true;
     sudo sed -i 's/Unattended-Upgrade "1"/Unattended-Upgrade "0"/g' /etc/apt/apt.conf.d/20auto-upgrades || true;
     sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1` || true;
@@ -81,31 +96,5 @@ setup_commands:
     mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n  StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n  StrictHostKeyChecking no\n" >> ~/.ssh/config;
     [ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
 
-# Command to start ray on the head node. You don't need to change this.
-# NOTE: these are very performance-sensitive. Each new item opens/closes an SSH
-# connection, which is expensive. Try your best to co-locate commands into fewer
-# items! The same comment applies for worker_start_ray_commands.
-#
-# Increment the following for catching performance bugs easier:
-#   current num items (num SSH connections): 2
-head_start_ray_commands:
-  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --min-worker-port 11002 --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
-    which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
-    {{dump_port_command}}; {{ray_head_wait_initialized_command}}
-
-{%- if num_nodes > 1 %}
-worker_start_ray_commands:
-  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --min-worker-port 11002 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
-    which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
-{%- else %}
-worker_start_ray_commands: []
-{%- endif %}
-
-head_node: {}
-worker_nodes: {}
-
-# These fields are required for external cloud providers.
-head_setup_commands: []
-worker_setup_commands: []
-cluster_synced_files: []
-file_mounts_sync_continuously: False
+# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
+# We do not need to list it here anymore.
diff --git a/sky/templates/oci-ray.yml.j2 b/sky/templates/oci-ray.yml.j2
index 32bd6326ee2..64fa4e745c7 100644
--- a/sky/templates/oci-ray.yml.j2
+++ b/sky/templates/oci-ray.yml.j2
@@ -16,7 +16,11 @@ provider:
   disable_launch_config_check: true
 
 auth:
+{% if os_type == "ubuntu" %}
   ssh_user: ubuntu
+{% else %}
+  ssh_user: opc
+{% endif %}
   ssh_private_key: {{ssh_private_key}}
 
 available_node_types:
@@ -85,14 +89,20 @@ setup_commands:
   # Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
   # Line 'mkdir -p ..': disable host key check
   # Line 'python3 -c ..': patch the buggy ray files and enable `-o allow_other` option for `goofys`
-  - sudo systemctl stop unattended-upgrades || true;
+  - echo "setup commands runs at $(date)" > /tmp/provision.tmp.out || true;
+    {%- if os_type == "ubuntu" %}
+    sudo systemctl stop unattended-upgrades || true;
     sudo systemctl disable unattended-upgrades || true;
     sudo sed -i 's/Unattended-Upgrade "1"/Unattended-Upgrade "0"/g' /etc/apt/apt.conf.d/20auto-upgrades || true;
     sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1` || true;
     sudo pkill -9 apt-get;
     sudo pkill -9 dpkg;
     sudo dpkg --configure -a;
-    ([ `sudo lshw -class display | grep "NVIDIA Corporation" | wc -l` -gt 0 ]) && (sudo which nvidia-smi > /dev/null || ( sudo apt-get install nvidia-driver-530-open -y && sudo apt-get install nvidia-driver-525-server -y ) || true);
+    {%- else %}
+    sudo /usr/libexec/oci-growfs -y || true;
+    sudo systemctl stop firewalld || true;
+    sudo systemctl disable firewalld || true;
+    {%- endif %}
     mkdir -p ~/.ssh; touch ~/.ssh/config;
     {{ conda_installation_commands }}
     {{ ray_skypilot_installation_commands }}
diff --git a/sky/utils/cli_utils/status_utils.py b/sky/utils/cli_utils/status_utils.py
index 09172f24814..96f9b5e9946 100644
--- a/sky/utils/cli_utils/status_utils.py
+++ b/sky/utils/cli_utils/status_utils.py
@@ -1,19 +1,20 @@
 """Utilities for sky status."""
-from typing import Any, Callable, Dict, List, Optional, Tuple
+import typing
+from typing import Any, Callable, Dict, List, Optional
 
 import click
 import colorama
 
 from sky import backends
-from sky import clouds as sky_clouds
-from sky import resources as resources_lib
 from sky import status_lib
-from sky.provision.kubernetes import utils as kubernetes_utils
 from sky.skylet import constants
 from sky.utils import common_utils
 from sky.utils import log_utils
 from sky.utils import resources_utils
 
+if typing.TYPE_CHECKING:
+    from sky.provision.kubernetes import utils as kubernetes_utils
+
 COMMAND_TRUNC_LENGTH = 25
 NUM_COST_REPORT_LINES = 5
 
@@ -303,19 +304,19 @@ def _get_estimated_cost_for_cost_report(
     return f'$ {cost:.2f}'
 
 
-def show_kubernetes_cluster_status_table(clusters: List[Any],
-                                         show_all: bool) -> None:
+def show_kubernetes_cluster_status_table(
+        clusters: List['kubernetes_utils.KubernetesSkyPilotClusterInfo'],
+        show_all: bool) -> None:
     """Compute cluster table values and display for Kubernetes clusters."""
     status_columns = [
-        StatusColumn('USER', lambda c: c['user']),
-        StatusColumn('NAME', lambda c: c['cluster_name']),
-        StatusColumn(
-            'LAUNCHED',
-            lambda c: log_utils.readable_time_duration(c['launched_at'])),
+        StatusColumn('USER', lambda c: c.user),
+        StatusColumn('NAME', lambda c: c.cluster_name),
+        StatusColumn('LAUNCHED',
+                     lambda c: log_utils.readable_time_duration(c.launched_at)),
         StatusColumn('RESOURCES',
-                     lambda c: c['resources_str'],
+                     lambda c: c.resources_str,
                      trunc_length=70 if not show_all else 0),
-        StatusColumn('STATUS', lambda c: c['status'].colored_str()),
+        StatusColumn('STATUS', lambda c: c.status.colored_str()),
         # TODO(romilb): We should consider adding POD_NAME field here when --all
         #  is passed to help users fetch pod name programmatically.
     ]
@@ -326,8 +327,7 @@ def show_kubernetes_cluster_status_table(clusters: List[Any],
     cluster_table = log_utils.create_table(columns)
 
     # Sort table by user, then by cluster name
-    sorted_clusters = sorted(clusters,
-                             key=lambda c: (c['user'], c['cluster_name']))
+    sorted_clusters = sorted(clusters, key=lambda c: (c.user, c.cluster_name))
 
     for cluster in sorted_clusters:
         row = []
@@ -344,122 +344,3 @@ def show_kubernetes_cluster_status_table(clusters: List[Any],
     else:
         click.echo('No SkyPilot resources found in the '
                    'active Kubernetes context.')
-
-
-def process_skypilot_pods(
-    pods: List[Any],
-    context: Optional[str] = None
-) -> Tuple[List[Dict[Any, Any]], Dict[str, Any], Dict[str, Any]]:
-    """Process SkyPilot pods on k8s to extract cluster and controller info.
-
-    Args:
-        pods: List of Kubernetes pod objects.
-        context: Kubernetes context name, used to detect GPU label formatter.
-
-    Returns:
-        A tuple containing:
-        - List of dictionaries with cluster information.
-        - Dictionary of job controller information.
-        - Dictionary of serve controller information.
-
-        Each dictionary contains the following keys:
-            'cluster_name_on_cloud': The cluster_name_on_cloud used by SkyPilot
-            'cluster_name': The cluster name without the user hash
-            'user': The user who created the cluster. Fetched from pod label
-            'status': The cluster status (assumed UP if pod exists)
-            'pods': List of pod objects in the cluster
-            'launched_at': Timestamp of when the cluster was launched
-            'resources': sky.Resources object for the cluster
-    """
-    clusters: Dict[str, Dict] = {}
-    jobs_controllers: Dict[str, Dict] = {}
-    serve_controllers: Dict[str, Dict] = {}
-
-    for pod in pods:
-        cluster_name_on_cloud = pod.metadata.labels.get('skypilot-cluster')
-        cluster_name = cluster_name_on_cloud.rsplit(
-            '-', 1
-        )[0]  # Remove the user hash to get cluster name (e.g., mycluster-2ea4)
-
-        # Check if cluster name is name of a controller
-        # Can't use controller_utils.Controllers.from_name(cluster_name)
-        # because hash is different across users
-        if 'controller' in cluster_name_on_cloud:
-            start_time = pod.status.start_time.timestamp()
-            controller_info = {
-                'cluster_name_on_cloud': cluster_name_on_cloud,
-                'cluster_name': cluster_name,
-                'user': pod.metadata.labels.get('skypilot-user'),
-                'status': status_lib.ClusterStatus.UP,
-                # Assuming UP if pod exists
-                'pods': [pod],
-                'launched_at': start_time
-            }
-            if 'sky-jobs-controller' in cluster_name_on_cloud:
-                jobs_controllers[cluster_name_on_cloud] = controller_info
-            elif 'sky-serve-controller' in cluster_name_on_cloud:
-                serve_controllers[cluster_name_on_cloud] = controller_info
-
-        if cluster_name_on_cloud not in clusters:
-            # Parse the start time for the cluster
-            start_time = pod.status.start_time
-            if start_time is not None:
-                start_time = pod.status.start_time.timestamp()
-
-            # Parse resources
-            cpu_request = kubernetes_utils.parse_cpu_or_gpu_resource(
-                pod.spec.containers[0].resources.requests.get('cpu', '0'))
-            memory_request = kubernetes_utils.parse_memory_resource(
-                pod.spec.containers[0].resources.requests.get('memory', '0'),
-                unit='G')
-            gpu_count = kubernetes_utils.parse_cpu_or_gpu_resource(
-                pod.spec.containers[0].resources.requests.get(
-                    'nvidia.com/gpu', '0'))
-            if gpu_count > 0:
-                label_formatter, _ = (
-                    kubernetes_utils.detect_gpu_label_formatter(context))
-                assert label_formatter is not None, (
-                    'GPU label formatter cannot be None if there are pods '
-                    f'requesting GPUs: {pod.metadata.name}')
-                gpu_label = label_formatter.get_label_key()
-                # Get GPU name from pod node selector
-                if pod.spec.node_selector is not None:
-                    gpu_name = label_formatter.get_accelerator_from_label_value(
-                        pod.spec.node_selector.get(gpu_label))
-
-            resources = resources_lib.Resources(
-                cloud=sky_clouds.Kubernetes(),
-                cpus=int(cpu_request),
-                memory=int(memory_request),
-                accelerators=(f'{gpu_name}:{gpu_count}'
-                              if gpu_count > 0 else None))
-            if pod.status.phase == 'Pending':
-                # If pod is pending, do not show it in the status
-                continue
-
-            clusters[cluster_name_on_cloud] = {
-                'cluster_name_on_cloud': cluster_name_on_cloud,
-                'cluster_name': cluster_name,
-                'user': pod.metadata.labels.get('skypilot-user'),
-                'status': status_lib.ClusterStatus.UP,
-                'pods': [],
-                'launched_at': start_time,
-                'resources': resources,
-            }
-        else:
-            # Update start_time if this pod started earlier
-            pod_start_time = pod.status.start_time
-            if pod_start_time is not None:
-                pod_start_time = pod_start_time.timestamp()
-                if pod_start_time < clusters[cluster_name_on_cloud][
-                        'launched_at']:
-                    clusters[cluster_name_on_cloud][
-                        'launched_at'] = pod_start_time
-        clusters[cluster_name_on_cloud]['pods'].append(pod)
-    # Update resources_str in clusters:
-    for cluster_name, cluster in clusters.items():
-        resources = cluster['resources']
-        num_pods = len(cluster['pods'])
-        resources_str = f'{num_pods}x {resources}'
-        cluster['resources_str'] = resources_str
-    return list(clusters.values()), jobs_controllers, serve_controllers
diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py
index be6e8346e3d..bbe287d9f79 100644
--- a/sky/utils/command_runner.py
+++ b/sky/utils/command_runner.py
@@ -502,8 +502,10 @@ def close_cached_connection(self) -> None:
         if self.ssh_control_name is not None:
             control_path = _ssh_control_path(self.ssh_control_name)
             if control_path is not None:
+                # Suppress the `Exit request sent.` output for this comamnd
+                # which would interrupt the CLI spinner.
                 cmd = (f'ssh -O exit -S {control_path}/%C '
-                       f'{self.ssh_user}@{self.ip}')
+                       f'{self.ssh_user}@{self.ip} > /dev/null 2>&1')
                 logger.debug(f'Closing cached connection {control_path!r} with '
                              f'cmd: {cmd}')
                 log_lib.run_with_log(cmd,
diff --git a/sky/utils/schemas.py b/sky/utils/schemas.py
index 6e752f73ebc..94a6ed690e1 100644
--- a/sky/utils/schemas.py
+++ b/sky/utils/schemas.py
@@ -755,6 +755,9 @@ def get_config_schema():
                 'force_enable_external_ips': {
                     'type': 'boolean'
                 },
+                'enable_gvnic': {
+                    'type': 'boolean'
+                },
                 **_LABELS_SCHEMA,
                 **_NETWORK_CONFIG_SCHEMA,
             },
diff --git a/sky/utils/ux_utils.py b/sky/utils/ux_utils.py
index f6699f355f8..2fffa8a9df9 100644
--- a/sky/utils/ux_utils.py
+++ b/sky/utils/ux_utils.py
@@ -121,11 +121,6 @@ def run(self, *args, **kwargs):
                 raise
 
 
-def starting_message(message: str) -> str:
-    """Gets the starting message for the given message."""
-    return f'⚙︎ {message}'
-
-
 def log_path_hint(log_path: Union[str, 'pathlib.Path']) -> str:
     """Gets the log path hint for the given log path."""
     log_path = str(log_path)
@@ -135,21 +130,50 @@ def log_path_hint(log_path: Union[str, 'pathlib.Path']) -> str:
     return _LOG_PATH_HINT.format(log_path=log_path)
 
 
+def starting_message(message: str) -> str:
+    """Gets the starting message for the given message."""
+    # We have to reset the color before the message, because sometimes if a
+    # previous spinner with dimmed color overflows in a narrow terminal, the
+    # color might be messed up.
+    return f'{colorama.Style.RESET_ALL}⚙︎ {message}'
+
+
 def finishing_message(
         message: str,
         log_path: Optional[Union[str, 'pathlib.Path']] = None) -> str:
     """Gets the finishing message for the given message."""
-    success_prefix = (f'{colorama.Fore.GREEN}✓ {message}'
-                      f'{colorama.Style.RESET_ALL}')
+    # We have to reset the color before the message, because sometimes if a
+    # previous spinner with dimmed color overflows in a narrow terminal, the
+    # color might be messed up.
+    success_prefix = (f'{colorama.Style.RESET_ALL}{colorama.Fore.GREEN}✓ '
+                      f'{message}{colorama.Style.RESET_ALL}')
     if log_path is None:
         return success_prefix
     path_hint = log_path_hint(log_path)
     return f'{success_prefix}  {path_hint}'
 
 
+def error_message(message: str,
+                  log_path: Optional[Union[str, 'pathlib.Path']] = None) -> str:
+    """Gets the error message for the given message."""
+    # We have to reset the color before the message, because sometimes if a
+    # previous spinner with dimmed color overflows in a narrow terminal, the
+    # color might be messed up.
+    error_prefix = (f'{colorama.Style.RESET_ALL}{colorama.Fore.RED}⨯'
+                    f'{colorama.Style.RESET_ALL} {message}')
+    if log_path is None:
+        return error_prefix
+    path_hint = log_path_hint(log_path)
+    return f'{error_prefix}  {path_hint}'
+
+
 def retry_message(message: str) -> str:
     """Gets the retry message for the given message."""
-    return f'{colorama.Fore.YELLOW}↺{colorama.Style.RESET_ALL} {message}'
+    # We have to reset the color before the message, because sometimes if a
+    # previous spinner with dimmed color overflows in a narrow terminal, the
+    # color might be messed up.
+    return (f'{colorama.Style.RESET_ALL}{colorama.Fore.YELLOW}↺'
+            f'{colorama.Style.RESET_ALL} {message}')
 
 
 def spinner_message(
diff --git a/tests/test_smoke.py b/tests/test_smoke.py
index cb14b82eb99..8b1737db121 100644
--- a/tests/test_smoke.py
+++ b/tests/test_smoke.py
@@ -383,7 +383,7 @@ def test_aws_region():
             f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .region | grep us-east-2\'',
             f'sky logs {name} 2 --status',  # Ensure the job succeeded.
             # A user program should not access SkyPilot runtime env python by default.
-            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'',
+            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} && exit 1 || true\'',
             f'sky logs {name} 3 --status',  # Ensure the job succeeded.
         ],
         f'sky down -y {name}',
@@ -406,7 +406,7 @@ def test_gcp_region_and_service_account():
             f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .region | grep us-central1\'',
             f'sky logs {name} 3 --status',  # Ensure the job succeeded.
             # A user program should not access SkyPilot runtime env python by default.
-            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'',
+            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} && exit 1 || true\'',
             f'sky logs {name} 4 --status',  # Ensure the job succeeded.
         ],
         f'sky down -y {name}',
@@ -446,7 +446,7 @@ def test_azure_region():
             f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .zone | grep null\'',
             f'sky logs {name} 3 --status',  # Ensure the job succeeded.
             # A user program should not access SkyPilot runtime env python by default.
-            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'',
+            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} && exit 1 || true\'',
             f'sky logs {name} 4 --status',  # Ensure the job succeeded.
         ],
         f'sky down -y {name}',
@@ -864,14 +864,14 @@ def test_custom_default_conda_env(generic_cloud: str):
         f'sky launch -c {name} -y --cloud {generic_cloud} tests/test_yamls/test_custom_default_conda_env.yaml',
         f'sky status -r {name} | grep "UP"',
         f'sky logs {name} 1 --status',
-        f'sky logs {name} 1 --no-follow | grep -P "myenv\\s+\\*"',
+        f'sky logs {name} 1 --no-follow | grep -E "myenv\\s+\\*"',
         f'sky exec {name} tests/test_yamls/test_custom_default_conda_env.yaml',
         f'sky logs {name} 2 --status',
         f'sky autostop -y -i 0 {name}',
         'sleep 60',
         f'sky status -r {name} | grep "STOPPED"',
         f'sky start -y {name}',
-        f'sky logs {name} 2 --no-follow | grep -P "myenv\\s+\\*"',
+        f'sky logs {name} 2 --no-follow | grep -E "myenv\\s+\\*"',
         f'sky exec {name} tests/test_yamls/test_custom_default_conda_env.yaml',
         f'sky logs {name} 3 --status',
     ], f'sky down -y {name}')