Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Deployment on existing infra #3926

Merged
merged 24 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ document.addEventListener('DOMContentLoaded', () => {
{ selector: '.toctree-l1 > a', text: 'Reserved, Capacity Blocks, DWS' },
{ selector: '.toctree-l1 > a', text: 'Llama 3.2 (Meta)' },
{ selector: '.toctree-l1 > a', text: 'Admin Policy Enforcement' },
{ selector: '.toctree-l1 > a', text: 'Using Existing Machines' },
];
newItems.forEach(({ selector, text }) => {
document.querySelectorAll(selector).forEach((el) => {
Expand Down
1 change: 1 addition & 0 deletions docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ Read the research:
:caption: Reserved & Existing Clusters

../reservations/reservations
Using Existing Machines <../reservations/existing-machines>
../reference/kubernetes/index

.. toctree::
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "dev pods"?
  • How about showing one arrow (step)? Can move "User provides" to left, and skypilot logo + text on right. This feels more like it's a tool, rather than a vendor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Good point, node conflicts with physical nodes. Changed to pods.
  • I think putting a single arrow de-emphasizes the need for having something like SkyPilot and makes it seem like the users are having to deal with low-level infra themselves.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion docs/source/reference/kubernetes/kubernetes-deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,10 @@ Deploying on Amazon EKS
Deploying on on-prem clusters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can also deploy Kubernetes on your on-prem clusters using off-the-shelf tools,
If you have a list of IP addresses and the SSH credentials for your on-prem cluster, you can follow our
:ref:`Using Existing Machines <existing-machines>` guide to set up SkyPilot on your on-prem cluster.

Alternatively, you can also deploy Kubernetes on your on-prem clusters using off-the-shelf tools,
such as `kubeadm <https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/>`_,
`k3s <https://docs.k3s.io/quick-start>`_ or
`Rancher <https://ranchermanager.docs.rancher.com/v2.5/pages-for-subheaders/kubernetes-clusters-in-rancher-setup>`_.
Expand Down
153 changes: 153 additions & 0 deletions docs/source/reservations/existing-machines.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
.. _existing-machines:

Deploy SkyPilot on existing machines
====================================

This guide will help you deploy SkyPilot on your existing machines - whether they are on-premisesc or reserved instances on a cloud provider.

**Given a list of IP addresses and SSH credentials,**
SkyPilot will install necessary dependencies on the remote machines and configure itself to run jobs and services on the cluster.

..
Figure v1 (for deploy.sh): https://docs.google.com/drawings/d/1Jp1tTu1kxF-bIrS6LRMqoJ1dnxlFvn-iobVsXElXfAg/edit?usp=sharing
Figure v2: https://docs.google.com/drawings/d/1hMvOe1HX0ESoUbCvUowla2zO5YBacsdruo0dFqML9vo/edit?usp=sharing
Figure v2 Dark: https://docs.google.com/drawings/d/1AEdf9i3SO6MVnD7d-hwRumIfVndzNDqQmrFvRwwVEiU/edit

.. figure:: ../images/sky-existing-infra-workflow-light.png
:width: 85%
:align: center
:alt: Deploying SkyPilot on existing machines
:class: no-scaled-link, only-light

Given a list of IP addresses and SSH keys, ``sky local up`` will install necessary dependencies on the remote machines and configure SkyPilot to run jobs and services on the cluster.

.. figure:: ../images/sky-existing-infra-workflow-dark.png
:width: 85%
:align: center
:alt: Deploying SkyPilot on existing machines
:class: no-scaled-link, only-dark

Given a list of IP addresses and SSH keys, ``sky local up`` will install necessary dependencies on the remote machines and configure SkyPilot to run jobs and services on the cluster.


.. note::

Behind the scenes, SkyPilot deploys a lightweight Kubernetes cluster on the remote machines using `k3s <https://k3s.io/>`_.

**Note that no Kubernetes knowledge is required for running this guide.** SkyPilot abstracts away the complexity of Kubernetes and provides a simple interface to run your jobs and services.

Prerequisites
-------------

**Local machine (typically your laptop):**

* `kubectl <https://kubernetes.io/docs/tasks/tools/install-kubectl/>`_
* `SkyPilot <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`_

**Remote machines (your cluster, optionally with GPUs):**

* Debian-based OS (tested on Debian 11)
* SSH access from local machine to all remote machines with key-based authentication and passwordless sudo
* All machines must use the same SSH key and username
* All machines must have network access to each other
* Port 6443 must be accessible on at least one node from your local machine

Deploying SkyPilot
------------------

1. Create a file ``ips.txt`` with the IP addresses of your machines with one IP per line.
The first node will be used as the head node - this node must have port 6443 accessible from your local machine.
romilbhardwaj marked this conversation as resolved.
Show resolved Hide resolved

Here is an example ``ips.txt`` file:

.. code-block:: text

192.168.1.1
192.168.1.2
192.168.1.3

In this example, the first node (``192.168.1.1``) has port 6443 open and will be used as the head node.

2. Run ``sky local up`` and pass the ``ips.txt`` file, SSH username, and SSH key as arguments:

.. code-block:: bash

IP_FILE=ips.txt
SSH_USER=username
SSH_KEY=path/to/ssh/key
sky local up --ip $IP_FILE --ssh-user SSH_USER --ssh-key-path $SSH_KEY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it --ip or --ips (L86)? Latter is more intuitive. (Or --ipfile like DeepSpeed's --hostfile.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch - it is indeed --ips, this typo was a remnant from the old deploy.sh.


SkyPilot will deploy a Kubernetes cluster on the remote machines, set up GPU support, configure Kubernetes credentials on your local machine, and set up SkyPilot to operate with the new cluster.

Example output of ``sky local up``:

.. code-block:: console

$ sky local up --ips ips.txt --ssh-user gcpuser --ssh-key-path ~/.ssh/id_rsa
Found existing kube config. It will be backed up to ~/.kube/config.bak.
To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-09-23-18-53-14-165534/local_up.log
✔ K3s successfully deployed on head node.
✔ K3s successfully deployed on worker node.
✔ kubectl configured for the remote cluster.
✔ Remote k3s is running.
✔ Nvidia GPU Operator installed successfully.
Cluster deployment done. You can now run tasks on this cluster.
E.g., run a task with: sky launch --cloud kubernetes -- echo hello world.
🎉 Remote cluster deployed successfully.


4. To verify that the cluster is running, run:

.. code-block:: bash

sky check kubernetes

You can now use SkyPilot to launch your :ref:`development clusters <dev-cluster>` and :ref:`training jobs <ai-training>` on your own infrastructure.

.. code-block:: console

$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 12 12
H100 1, 2, 4, 8 16 16

Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-cluster-0 L4 4 4
my-cluster-1 L4 4 4
my-cluster-2 L4 2 2
my-cluster-3 L4 2 2
my-cluster-4 H100 8 8
my-cluster-5 H100 8 8

$ sky launch --cloud kubernetes --gpus H100:1 -- nvidia-smi

.. tip::

You can also use ``kubectl`` to interact and perform administrative operations on the cluster.

What happens behind the scenes?
-------------------------------

When you run ``sky local up``, SkyPilot runs the following operations:

1. Install and run `k3s <https://k3s.io/>`_ Kubernetes distribution as a systemd service on the remote machines.
2. [If GPUs are present] Install `Nvidia GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html>`_ on the newly provisioned k3s cluster. Note that this step does not modify your local nvidia driver/cuda installation, and only runs inside the cluster.
3. Expose the Kubernetes API server on the head node over port 6443. API calls are on this port are secured with a key pair generated by the cluster.
4. Configure ``kubectl`` on your local machine to connect to the remote cluster.


Cleanup
-------

To clean up all state created by SkyPilot on your machines, use the ``--cleanup`` flag:

.. code-block:: bash

IP_FILE=ips.txt
SSH_USER=username
SSH_KEY=path/to/ssh/key
sky local up --ip $IP_FILE --ssh-user SSH_USER --ssh-key-path $SSH_KEY --cleanup

This will stop all Kubernetes services on the remote machines.
2 changes: 1 addition & 1 deletion docs/source/reservations/reservations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -204,5 +204,5 @@ Unlike short-term reservations above, long-term reservations are typically more

SkyPilot supports long-term reservations and on-premise clusters through Kubernetes, i.e., you can set up a Kubernetes cluster on top of your reserved resources and interact with them through SkyPilot.

See the simple steps to set up a Kubernetes cluster on existing machines in :ref:`kubernetes-overview`.
See the simple steps to set up a Kubernetes cluster on existing machines in :ref:`Using Existing Machines <existing-machines>` or :ref:`bring your existing Kubernetes cluster <kubernetes-overview>`.

127 changes: 118 additions & 9 deletions sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -5072,15 +5072,7 @@ def local():
pass


@click.option('--gpus/--no-gpus',
default=True,
is_flag=True,
help='Launch cluster without GPU support even '
'if GPUs are detected on the host.')
@local.command('up', cls=_DocumentedCodeCommand)
@usage_lib.entrypoint
def local_up(gpus: bool):
"""Creates a local cluster."""
def deploy_local_cluster(gpus: bool):
cluster_created = False

# Check if GPUs are available on the host
Expand Down Expand Up @@ -5206,6 +5198,123 @@ def local_up(gpus: bool):
f'{gpu_hint}')


def deploy_remote_cluster(ip_file, ssh_user, ssh_key_path, cleanup):
romilbhardwaj marked this conversation as resolved.
Show resolved Hide resolved
success = False
path_to_package = os.path.dirname(os.path.dirname(__file__))
up_script_path = os.path.join(path_to_package, 'sky/utils/kubernetes',
'deploy_remote_cluster.sh')
# Get directory of script and run it from there
cwd = os.path.dirname(os.path.abspath(up_script_path))

deploy_command = f'{up_script_path} {ip_file} {ssh_user} {ssh_key_path}'
if cleanup:
deploy_command += ' --cleanup'

# Convert the command to a format suitable for subprocess
deploy_command = shlex.split(deploy_command)

# Setup logging paths
run_timestamp = backend_utils.get_run_timestamp()
log_path = os.path.join(constants.SKY_LOGS_DIRECTORY, run_timestamp,
'local_up.log')
tail_cmd = 'tail -n100 -f ' + log_path

# Check if ~/.kube/config exists:
if os.path.exists(os.path.expanduser('~/.kube/config')):
click.echo('Found existing kube config. '
'It will be backed up to ~/.kube/config.bak.')
style = colorama.Style
click.echo('To view detailed progress: '
f'{style.BRIGHT}{tail_cmd}{style.RESET_ALL}')
if cleanup:
msg_str = 'Cleaning up remote cluster...'
else:
msg_str = 'Deploying remote cluster...'
with rich_utils.safe_status(f'[bold cyan]{msg_str}'):
returncode, _, stderr = log_lib.run_with_log(
cmd=deploy_command,
log_path=log_path,
require_outputs=True,
stream_logs=False,
line_processor=log_utils.SkyRemoteUpLineProcessor(),
cwd=cwd)
if returncode == 0:
success = True
else:
with ux_utils.print_exception_no_traceback():
raise RuntimeError(
'Failed to deploy remote cluster. '
f'Full log: {log_path}'
f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}')

if success:
if cleanup:
click.echo(f'{colorama.Fore.GREEN}'
'🎉 Remote cluster cleaned up successfully.'
f'{style.RESET_ALL}')
else:
click.echo('Cluster deployment done. You can now run tasks on '
'this cluster.\nE.g., run a task with: '
'sky launch --cloud kubernetes -- echo hello world.'
f'\n{colorama.Fore.GREEN}🎉 Remote cluster deployed '
f'successfully. {style.RESET_ALL}')


@click.option('--gpus/--no-gpus',
default=True,
is_flag=True,
help='Launch cluster without GPU support even '
'if GPUs are detected on the host.')
@click.option(
'--ips',
type=str,
required=False,
help='Path to the file containing IP addresses of remote machines.')
@click.option('--ssh-user',
type=str,
required=False,
help='SSH username for accessing remote machines.')
@click.option('--ssh-key-path',
type=str,
required=False,
help='Path to the SSH private key.')
@click.option('--cleanup',
is_flag=True,
help='Clean up the remote cluster instead of deploying it.')
@local.command('up', cls=_DocumentedCodeCommand)
@usage_lib.entrypoint
def local_up(gpus: bool, ips: str, ssh_user: str, ssh_key_path: str,
cleanup: bool):
"""Creates a local or remote cluster."""

def _validate_args(ips, ssh_user, ssh_key_path, cleanup):
# If any of --ips, --ssh-user, or --ssh-key-path is specified,
# all must be specified
if bool(ips) or bool(ssh_user) or bool(ssh_key_path):
if not (ips and ssh_user and ssh_key_path):
raise click.BadParameter(
'All --ips, --ssh-user, and --ssh-key-path '
'must be specified together.')

# --cleanup can only be used if --ips, --ssh-user and --ssh-key-path
# are all provided
if cleanup and not (ips and ssh_user and ssh_key_path):
raise click.BadParameter('--cleanup can only be used with '
'--ips, --ssh-user and --ssh-key-path.')

_validate_args(ips, ssh_user, ssh_key_path, cleanup)

# If remote deployment arguments are specified, run remote up script
if ips and ssh_user and ssh_key_path:
# Convert ips and ssh_key_path to absolute paths
ips = os.path.abspath(ips)
ssh_key_path = os.path.abspath(ssh_key_path)
deploy_remote_cluster(ips, ssh_user, ssh_key_path, cleanup)
else:
# Run local deployment (kind) if no remote args are specified
deploy_local_cluster(gpus)


@local.command('down', cls=_DocumentedCodeCommand)
@usage_lib.entrypoint
def local_down():
Expand Down
Loading
Loading