Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into support-fractional-a10
Browse files Browse the repository at this point in the history
  • Loading branch information
cblmemo committed Sep 3, 2024
2 parents 07e47d6 + f8c62cb commit 639c686
Show file tree
Hide file tree
Showing 38 changed files with 832 additions and 107 deletions.
2 changes: 1 addition & 1 deletion docs/source/_static/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ html[data-theme="dark"] {
padding: 2px 5px; /* Reduced padding for a more compact label */
margin-left: 6px; /* Space between the text and the label */

vertical-align: middle;
vertical-align: text-bottom;
line-height: 1; /* Adjust line height to ensure vertical alignment */
}

Expand Down
3 changes: 2 additions & 1 deletion docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,9 @@ document.addEventListener('DOMContentLoaded', () => {
const newItems = [
{ selector: '.caption-text', text: 'SkyServe: Model Serving' },
{ selector: '.toctree-l1 > a', text: 'Managed Jobs' },
{ selector: '.toctree-l1 > a', text: 'Running on Kubernetes' },
{ selector: '.toctree-l1 > a', text: 'Llama-3.1 (Meta)' },
{ selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' },
{ selector: '.toctree-l1 > a', text: 'Reserved, Capacity Blocks, DWS' },
];
newItems.forEach(({ selector, text }) => {
document.querySelectorAll(selector).forEach((el) => {
Expand Down
8 changes: 8 additions & 0 deletions docs/source/developers/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Developer Guides
=================

.. toctree::
:maxdepth: 1

../developers/CONTRIBUTING
Guide: Adding a New Cloud <https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?usp=sharing>
21 changes: 11 additions & 10 deletions docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,8 @@ Read the research:

../getting-started/installation
../getting-started/quickstart
../getting-started/tutorial
../examples/interactive-development
../getting-started/tutorial


.. toctree::
Expand All @@ -141,8 +141,16 @@ Read the research:
../examples/managed-jobs
../reference/job-queue
../examples/auto-failover
../reference/kubernetes/index
../running-jobs/distributed-jobs
../running-jobs/many-jobs

.. toctree::
:hidden:
:maxdepth: 1
:caption: Reserved & Existing Clusters

../reservations/reservations
../reference/kubernetes/index

.. toctree::
:hidden:
Expand Down Expand Up @@ -184,14 +192,6 @@ Read the research:
SkyPilot vs. Other Systems <../reference/comparison>


.. toctree::
:hidden:
:maxdepth: 1
:caption: Developer Guides

../developers/CONTRIBUTING
Guide: Adding a New Cloud <https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?usp=sharing>

.. toctree::
:hidden:
:maxdepth: 1
Expand All @@ -210,4 +210,5 @@ Read the research:
../reference/cli
../reference/api
../reference/config
../developers/index

9 changes: 9 additions & 0 deletions docs/source/examples/docker-containers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,15 @@ Any GPUs assigned to the task will be automatically mapped to your Docker contai

2. The container image must grant sudo permissions without requiring password authentication for the user. Having a root user is also acceptable.

.. note::

Using a container with a customized entrypoint as a runtime environment is
supported, with the container's entrypoint being overridden by :code:`/bin/bash`.
Specific commands can be executed in the :code:`setup` and :code:`run` sections
of the task YAML file. However, this approach is not compatible with RunPod due
to limitations in the RunPod API, so ensure that you choose a container with a
default entrypoint (i.e. :code:`/bin/bash`).

Private Registries
^^^^^^^^^^^^^^^^^^

Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ Congratulations! In this quickstart, you have launched a cluster, run a task, a

Next steps:

- Adapt :ref:`Tutorial: DNN Training <dnn-training>` to start running your own project on SkyPilot!
- Adapt :ref:`Tutorial: AI Training <ai-training>` to start running your own project on SkyPilot!
- See the :ref:`Task YAML reference <yaml-spec>`, :ref:`CLI reference <cli>`, and `more examples <https://github.com/skypilot-org/skypilot/tree/master/examples>`_
- To learn more, try out `SkyPilot Tutorials <https://github.com/skypilot-org/skypilot-tutorial>`_ in Jupyter notebooks

Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting-started/tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _dnn-training:
.. _ai-training:

Tutorial: DNN Training
Tutorial: AI Training
======================
This example uses SkyPilot to train a Transformer-based language model from HuggingFace.

Expand Down
17 changes: 0 additions & 17 deletions docs/source/reference/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -213,20 +213,3 @@ To launch a VS Code tunnel using a SkyPilot task definition, you can use the fol
Note that you'll be prompted to authenticate with your GitHub account to launch a VS Code tunnel.

PyTorch 2.2.0 failed on SkyPilot clusters. What should I do?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The latest PyTorch release (2.2.0) has a version conflict with the default cuDNN version on SkyPilot clusters, which may raise a segmentation fault when you run the job.

To fix this, you can choose one of the following solutions:

1. Use older version of PyTorch (like 2.1.0) instead of 2.2.0, i.e. :code:`pip install "torch<2.2"`;
2. Remove the cuDNN from the cluster's :code:`LD_LIBRARY_PATH` by adding the following line to your task:

.. code-block:: yaml
run: |
export LD_LIBRARY_PATH=$(echo $LD_LIBRARY_PATH | sed 's|:/usr/local/cuda/lib64||g; s|/usr/local/cuda/lib64:||g; s|/usr/local/cuda/lib64||g')
# Other commands using PyTorch 2.2.0
...
2 changes: 1 addition & 1 deletion docs/source/reference/job-queue.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ SkyPilot's scheduler serves two goals:
2. **Minimizing resource idleness**: If a resource is idle, SkyPilot will schedule a
queued job that can utilize that resource.

We illustrate the scheduling behavior by revisiting :ref:`Tutorial: DNN Training <dnn-training>`.
We illustrate the scheduling behavior by revisiting :ref:`Tutorial: AI Training <ai-training>`.
In that tutorial, we have a task YAML that specifies these resource requirements:

.. code-block:: yaml
Expand Down
6 changes: 3 additions & 3 deletions docs/source/reference/kubernetes/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _kubernetes-overview:

Running on Kubernetes
=============================
Using Kubernetes
================

SkyPilot tasks can be run on your private on-prem or cloud Kubernetes clusters.
The Kubernetes cluster gets added to the list of "clouds" in SkyPilot and SkyPilot
Expand Down Expand Up @@ -116,4 +116,4 @@ Kubernetes support is under active development. Some features are in progress an
* Multi-node tasks - ✅ Available
* Custom images - ✅ Available
* Opening ports and exposing services - ✅ Available
* Multiple Kubernetes Clusters - 🚧 In progress
* Multiple Kubernetes Clusters - 🚧 In progress
14 changes: 8 additions & 6 deletions docs/source/reference/yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,12 +113,14 @@ Available fields:
disk_size: 256
# Disk tier to use for OS (optional).
# Could be one of 'low', 'medium', 'high' or 'best' (default: 'medium').
# Could be one of 'low', 'medium', 'high', 'ultra' or 'best' (default: 'medium').
# if 'best' is specified, use the best disk tier enabled.
# Rough performance estimate:
# low: 500 IOPS; read 20MB/s; write 40 MB/s
# medium: 3000 IOPS; read 220 MB/s; write 200 MB/s
# high: 6000 IOPS; 340 MB/s; write 250 MB/s
# low: 1000 IOPS; read 90 MB/s; write 90 MB/s
# medium: 3000 IOPS; read 220 MB/s; write 220 MB/s
# high: 6000 IOPS; read 400 MB/s; write 400 MB/s
# ultra: 60000 IOPS; read 4000 MB/s; write 3000 MB/s
# Measured by examples/perf/storage_rawperf.yaml
disk_tier: medium
# Ports to expose (optional).
Expand Down Expand Up @@ -335,8 +337,8 @@ Available fields:
.. _task-yaml-experimental:

Experimental
------------
Experimental Configurations
---------------------------

.. note::

Expand Down
208 changes: 208 additions & 0 deletions docs/source/reservations/reservations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@

.. _reservation:

Reserved, Capacity Blocks, DWS
===================================


With the recent GPU shortage, reservations from cloud providers have become a common way to ensure GPU availability for a specific duration. These reservations can be short-term (e.g., 1-30 days) capacity guarantees, or long-term (e.g., 1-3 years) contracts.

This guide shows how to use SkyPilot to request resources from reservations and even combine them with on-demand/spot resources to fully
utilize the capacity in your cloud accounts.

.. image:: https://i.imgur.com/FA0BT0E.png
:width: 95%
:align: center


AWS Capacity Reservations & Capacity Blocks
--------------------------------------------

AWS **capacity reservations** and **capacity blocks** are ways to reserve a certain amount of compute capacity for a period of time. The latter is for high-end GPUs, such as A100s (P4d instances) and H100s (P5d instances), while the former is for all other instance types.
Instead of committing to a 1-3 year long contract, you can get a capacity reservation or capacity block for as short as 1 second or 1 day, respectively.


To request capacity reservations/blocks, see the official docs:

* `AWS Capacity Reservations <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html>`_
* `AWS Capacity Blocks <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html>`_

Once you have successfully created a reservation/block, you will get an ID of the reservation/block, such as ``cr-012345678``.

To use the reservation/block, you can specify two fields in ``~/.sky/config.yaml``:

* ``aws.prioritize_reservations``: whether to prioritize launching clusters from capacity reservations in any region/zone over on-demand/spot clusters. This is useful to fully utilize your reserved capacity created with ``Instance eligibility: open``.
* ``aws.specific_reservations``: a list of reservation IDs that can be used by SkyPilot. This is useful if you have multiple capacity reservations or blocks with ``Instance eligibility: targeted`` for different instance types in multiple regions/zones.


Example:

.. code-block:: yaml
aws:
prioritize_reservations: true
specific_reservations:
# 1x H100 capacity block in us-east-1
- "cr-0123456789"
# 2x A100 reservation in us-east-2
- "cr-123456789a"
# 2x A100 reservation in us-west-2
- "cr-23456789ab"
# 2x M5a.16xlarge reservation in us-east-1
- "cr-3456789abc"
For more details of the fields, see :ref:`config-yaml`.

.. note::

If any of the fields are specified, SkyPilot optimizer may take around 30 seconds to retrieve the latest reservation/block status on all regions and zones from your AWS account.


.. _utilizing-reservations:

Utilizing Reservations
~~~~~~~~~~~~~~~~~~~~~~

By specifying configurations above, SkyPilot will prioritize using any available capacity in reservation/block (i.e., consider them as zero cost) whenever you launch a cluster/job.

Specifically, SkyPilot's behavior is as follows:

1. Query reservations/blocks across AWS regions and zones to find all available capacity. (If the task specifies specific regions or zones to use, only those are queried.)
2. For each zone, calculate its cost: any available reserved capacity is considered as zero cost, and if any on-demand/spot resource is needed to supplement the available reserved capacity to fully satisfy the request, their on-demand/spot price is included.
3. :ref:`Automatically failover <auto-failover>` through these zones in increasing per-zone cost order until the requested resources are provisioned.


For example, if you are launching a cluster with the following SkyPilot YAML:

.. code-block:: yaml
resources:
cloud: aws
accelerators: A100:8
num_nodes: 2
SkyPilot will utilize the capacity reservation/block as follows:

1. Query reservations/blocks in ``us-east-2`` and ``us-west-2`` in reservation ``cr-123456789a`` and ``cr-23456789ab``, respectively. Assume the results are:

- 1 A100 instance capacity is available in ``us-east-2``,
- No available capacity in ``us-west-2``.
2. SkyPilot calculates the pricing for all zones as described above. The result is ``us-east-2`` zones are cheaper than all other zones, because the former's costs are 1 on-demand node's cost for 2 nodes (by satisfying 1 node using the reserved capacity).
3. SkyPilot will thus try to provision an on-demand A100 instance in ``us-east-2``. On unavailability, SkyPilot will continue to :ref:`automatically failover <auto-failover>` to other clouds/regions/zones for normal on-demand/spot instances.


.. hint::

If you have a capacity block with a starting time in the future, you can run ``sky jobs launch --region us-east-1 --gpus H100:8 task.yaml`` to let SkyPilot automatically wait until the starting time is reached. Namely, you don't have to wake up at 4:30am PDT to launch your job on a newly available capacity block.


GCP reservations
-----------------

GCP reservations are similar to AWS capacity reservations, where you can reserve a certain amount of compute capacity for any period of time.

To get a reservation, see the `GCP official docs <https://cloud.google.com/compute/docs/instances/reservations-single-project>`__.

Like AWS, you can specify two fields in ``~/.sky/config.yaml``:

* ``gcp.prioritize_reservations``: whether to prioritize launching clusters from reservations in any region/zone over on-demand/spot clusters. This is useful to fully utilize your `automatically consumed reservations <https://cloud.google.com/compute/docs/instances/reservations-consume#consuming_instances_from_any_matching_reservation>`__.
* ``gcp.specific_reservations``: a list of reservation IDs that can be used by SkyPilot. This is useful if you have multiple `specific reservations <https://cloud.google.com/compute/docs/instances/reservations-consume#consuming_instances_from_a_specific_reservation>`__ for different instance types in multiple regions/zones.

Example:

.. code-block:: yaml
gcp:
prioritize_reservations: true
specific_reservations:
- projects/my-project/reservations/my-reservation1
- projects/my-project/reservations/my-reservation2
SkyPilot will utilize the reservations similar to AWS reservations as described in :ref:`utilizing-reservations`.


GCP Dynamic Workload Scheduler (DWS)
-------------------------------------

GCP `Dynamic Workload Scheduler (DWS) <https://cloud.google.com/blog/products/compute/introducing-dynamic-workload-scheduler>`__ is a resource management service that (1) receives a GPU capacity request, (2) automatically provisions the requested resources when they become available, and (3) keeps the resources running for a specified duration.

.. tip::

It has been observed that using DWS can significantly increase the chance of getting a high-end GPU resource, such as A100s and H100s, compared to using on-demand or spot instances.


Using DWS for VMs
~~~~~~~~~~~~~~~~~

SkyPilot allows you to launch resources via DWS by specifying the ``gcp.managed_instance_group`` field in ``~/.sky/config.yaml``:

.. code-block:: yaml
gcp:
managed_instance_group:
run_duration: 3600
provision_timeout: 900
1. ``run_duration``: duration for a created instance to be kept alive (in seconds, required).
2. ``provision_timeout``: timeout for provisioning an instance with DWS (in seconds, optional). If the timeout is reached without requested resources being provisioned, SkyPilot will automatically :ref:`failover <auto-failover>` to other clouds/regions/zones to get the resources.

See :ref:`config-yaml` for more details.

In case you want to specify the DWS configuration for each job/cluster, you can also specify the configuration in the SkyPilot task YAML (see :ref:`here <task-yaml-experimental>`):

.. code-block:: yaml
experimental:
config_overrides:
gcp:
managed_instance_group:
run_duration: 3600
provision_timeout: 900
resources:
cloud: gcp
accelerators: A100:8
num_nodes: 4
Using DWS on GKE with Kueue
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

DWS is also supported on Google Kubernetes Engine (GKE) with Kueue. To enable DWS on GKE, you need to set up your GKE cluster with Kueue and DWS; see the `GCP official docs <https://cloud.google.com/kubernetes-engine/docs/how-to/provisioningrequest>`__.

To launch a SkyPilot cluster or job on GKE with DWS, you can specify the DWS configuration in the SkyPilot task YAML:

.. code-block:: yaml
experimental:
config_overrides:
kubernetes:
pod_config:
metadata:
annotations:
provreq.kueue.x-k8s.io/maxRunDurationSeconds: "3600"
provision_timeout: 900
resourcse:
cloud: kubernetes
accelerators: A100:8
labels:
kueue.x-k8s.io/queue-name: dws-local-queue
1. ``kueue.x-k8s.io/queue-name``: name of the Kueue queue to submit your resource request to.
2. ``provreq.kueue.x-k8s.io/maxRunDurationSeconds``: maximum duration for a created instance to be kept alive (in seconds, required).
3. ``provision_timeout``: timeout for provisioning an instance with DWS (in seconds, optional). If the timeout is reached without getting the requested resources, SkyPilot will automatically :ref:`failover <auto-failover>` to other clouds/regions/zones to get the resources.

Long-term reservations
----------------------

Unlike short-term reservations above, long-term reservations are typically more than one month long and can be viewed as a type of *on-prem cluster*.

SkyPilot supports long-term reservations and on-premise clusters through Kubernetes, i.e., you can set up a Kubernetes cluster on top of your reserved resources and interact with them through SkyPilot.

See the simple steps to set up a Kubernetes cluster on existing machines in :ref:`kubernetes-overview`.

Loading

0 comments on commit 639c686

Please sign in to comment.