Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DigitalOcean] droplet integration #3832

Open
wants to merge 92 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 83 commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
96c5e81
init digital ocean droplet integration
asaiacai Aug 14, 2024
fa8a6bb
abbreviate cloud name
asaiacai Aug 20, 2024
cc8384e
switch to pydo
asaiacai Aug 20, 2024
80b5941
adjust polling logic and mount block storage to instance
asaiacai Aug 31, 2024
49c411b
merge
asaiacai Aug 31, 2024
a741f11
filter by paginated
asaiacai Aug 31, 2024
8702acd
lint
asaiacai Sep 2, 2024
1de819c
sky launch, start, stop functional
asaiacai Sep 2, 2024
fafc71d
fix credential file mounts, autodown works now
asaiacai Sep 2, 2024
e50126c
set gpu droplet image
asaiacai Sep 3, 2024
8532fcf
cleanup
asaiacai Sep 3, 2024
13628ad
remove more tests
asaiacai Sep 3, 2024
34d1916
atomically destroy instance and block storage simulatenously
asaiacai Sep 3, 2024
5eab8f9
install docker
asaiacai Sep 3, 2024
c992161
disable spot test
asaiacai Sep 3, 2024
a868b1a
fix ip address bug for multinode
asaiacai Sep 6, 2024
d4f7794
lint
asaiacai Sep 6, 2024
30ead7b
patch ssh from job/serve controller
asaiacai Sep 6, 2024
6791a7d
switch to EA slugs
asaiacai Sep 6, 2024
af8e5e9
do adaptor
asaiacai Sep 10, 2024
3a31a0a
lint
asaiacai Sep 11, 2024
ce900ed
Update sky/clouds/do.py
asaiacai Sep 17, 2024
391fea1
Update sky/clouds/do.py
asaiacai Sep 17, 2024
1703b40
comment template
asaiacai Sep 17, 2024
66f0314
comment patch
asaiacai Sep 17, 2024
817f3b3
add h100 test case
asaiacai Sep 17, 2024
5d8368c
comment on instance name length
asaiacai Sep 17, 2024
74856df
Update sky/clouds/do.py
asaiacai Sep 18, 2024
cbbb36b
Update sky/clouds/service_catalog/do_catalog.py
asaiacai Sep 18, 2024
ee98000
comment on max node char len
asaiacai Sep 23, 2024
d6da5e8
comment on weird azure import
asaiacai Sep 23, 2024
79aac0a
comment acc price is included in instance price
asaiacai Sep 23, 2024
71c9f9a
fix return type
asaiacai Sep 23, 2024
4fc8fe8
switch with do_utils
asaiacai Sep 23, 2024
113d24d
remove broad except
asaiacai Sep 23, 2024
1e0f9ec
Update sky/provision/do/instance.py
asaiacai Sep 23, 2024
4ab385b
Update sky/provision/do/instance.py
asaiacai Sep 23, 2024
daa7446
remove azure
asaiacai Sep 23, 2024
0d71031
comment on non_terminated_only
asaiacai Sep 23, 2024
dd8c238
add open port debug message
asaiacai Sep 29, 2024
cf7947b
wrap start instance api
asaiacai Sep 29, 2024
56163c0
use f-string
asaiacai Sep 29, 2024
0d39425
wrap stop
asaiacai Sep 29, 2024
0f8a53b
wrap instance down
asaiacai Sep 29, 2024
2881508
assert credentials and check against all contexts
asaiacai Sep 29, 2024
ae76a80
assert client is None
asaiacai Sep 29, 2024
8056bc8
remove pending instances during instance restart
asaiacai Sep 29, 2024
9bdf9df
wrap rename
asaiacai Sep 29, 2024
6cccf6a
rename ssh key var
asaiacai Oct 4, 2024
901ed4e
fix tags
asaiacai Oct 4, 2024
7d57980
add tags for block device
asaiacai Oct 4, 2024
e8d1782
f strings for errors
asaiacai Oct 4, 2024
2e51c59
support image ids
asaiacai Oct 21, 2024
b5fe945
update do tests
Oct 24, 2024
6565fff
only store head instance id
Oct 24, 2024
c6a4583
Merge branch 'skypilot-org:master' into droplet
asaiacai Oct 25, 2024
fde2bc2
rename image slugs
Oct 25, 2024
baf5b48
Merge branch 'droplet' of https://github.com/asaiacai/skypilot into d…
Oct 25, 2024
ff87fe7
add digital ocean alias
Oct 25, 2024
c49c330
wait for docker to be available
Oct 25, 2024
c857fe9
Merge branch 'skypilot-org:master' into droplet
asaiacai Oct 25, 2024
40b2134
update requirements and tests
Oct 25, 2024
65bfc03
increase docker timeout
Oct 25, 2024
812f747
lint
Oct 26, 2024
031777a
Merge branch 'skypilot-org:master' into droplet
asaiacai Nov 4, 2024
cefe4a7
search in default context
asaiacai Nov 6, 2024
3e5bd6b
revert docker wait and use ai-ml base image
asaiacai Nov 6, 2024
fc38c26
match _check_docker_installed
asaiacai Nov 6, 2024
e1628b7
Merge branch 'master' into droplet
asaiacai Nov 7, 2024
b92a99b
comment cred file mount init
asaiacai Nov 12, 2024
f8b6d04
Update sky/clouds/do.py
asaiacai Nov 12, 2024
1065ab8
add egress facts
asaiacai Nov 12, 2024
99cdf7d
Update sky/clouds/do.py
asaiacai Nov 12, 2024
e7e237b
add name for TODO
asaiacai Nov 12, 2024
808d1bf
Update sky/templates/do-ray.yml.j2
asaiacai Nov 12, 2024
006b08e
add TODO for fetchers
asaiacai Nov 12, 2024
49470bb
Update sky/clouds/do.py
asaiacai Nov 12, 2024
d24c044
Merge branch 'skypilot-org:master' into droplet
asaiacai Nov 12, 2024
69c882e
use make_ray_custom_resources_str
asaiacai Nov 12, 2024
de931f0
parametrize pytest
asaiacai Nov 21, 2024
147eb95
raise error instead of assert and suggest auth path
asaiacai Nov 21, 2024
0d76083
fix error message spacing
asaiacai Nov 21, 2024
a92d846
err handle missing credentials to avoid interrupting test_job CI
asaiacai Nov 21, 2024
682e49c
Update sky/clouds/service_catalog/do_catalog.py
asaiacai Nov 27, 2024
a1587e0
Update sky/provision/do/utils.py
asaiacai Nov 27, 2024
3d63531
fix typo
asaiacai Nov 27, 2024
0925f45
have debug messages instead of throwing error for missing do errors f…
asaiacai Nov 27, 2024
21e4f4c
Merge branch 'skypilot-org:master' into droplet
asaiacai Nov 27, 2024
8658ffa
fix mypy
asaiacai Nov 27, 2024
2c7d530
fix missing do credential for CI
asaiacai Nov 27, 2024
af2f0dc
lint
asaiacai Nov 27, 2024
e247d0d
Merge branch 'master' into droplet
asaiacai Dec 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/getting-started/installation.rst
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ Install SkyPilot using pip:
pip install "skypilot-nightly[runpod]"
pip install "skypilot-nightly[fluidstack]"
pip install "skypilot-nightly[paperspace]"
pip install "skypilot-nightly[do]"
pip install "skypilot-nightly[cudo]"
pip install "skypilot-nightly[ibm]"
pip install "skypilot-nightly[scp]"
Expand Down
20 changes: 20 additions & 0 deletions sky/adaptors/do.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
"""Digital Ocean cloud adaptors"""

# pylint: disable=import-outside-toplevel

from sky.adaptors import common

_IMPORT_ERROR_MESSAGE = ('Failed to import dependencies for DO. '
'Try pip install "skypilot[do]"')
pydo = common.LazyImport('pydo', import_error_message=_IMPORT_ERROR_MESSAGE)
azure = common.LazyImport('azure', import_error_message=_IMPORT_ERROR_MESSAGE)
_LAZY_MODULES = (pydo, azure)


# `pydo`` inherits Azure exceptions. See:
# https://github.com/digitalocean/pydo/blob/7b01498d99eb0d3a772366b642e5fab3d6fc6aa2/examples/poc_droplets_volumes_sshkeys.py#L6
@common.load_lazy_modules(modules=_LAZY_MODULES)
def exceptions():
"""Azure exceptions."""
from azure.core import exceptions as azure_exceptions
return azure_exceptions
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -913,6 +913,7 @@ def _add_auth_to_cluster_config(cloud: clouds.Cloud, cluster_config_file: str):
clouds.Cudo,
clouds.Paperspace,
clouds.Azure,
clouds.DO,
)):
config = auth.configure_ssh_info(config)
elif isinstance(cloud, clouds.GCP):
Expand Down
1 change: 1 addition & 0 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ def _get_cluster_config_template(cloud):
clouds.SCP: 'scp-ray.yml.j2',
clouds.OCI: 'oci-ray.yml.j2',
clouds.Paperspace: 'paperspace-ray.yml.j2',
clouds.DO: 'do-ray.yml.j2',
clouds.RunPod: 'runpod-ray.yml.j2',
clouds.Kubernetes: 'kubernetes-ray.yml.j2',
clouds.Vsphere: 'vsphere-ray.yml.j2',
Expand Down
2 changes: 2 additions & 0 deletions sky/clouds/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from sky.clouds.aws import AWS
from sky.clouds.azure import Azure
from sky.clouds.cudo import Cudo
from sky.clouds.do import DO
from sky.clouds.fluidstack import Fluidstack
from sky.clouds.gcp import GCP
from sky.clouds.ibm import IBM
Expand All @@ -34,6 +35,7 @@
'Cudo',
'GCP',
'Lambda',
'DO',
'Paperspace',
'SCP',
'RunPod',
Expand Down
300 changes: 300 additions & 0 deletions sky/clouds/do.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,300 @@
""" Digital Ocean Cloud. """

import typing
from typing import Dict, Iterator, List, Optional, Tuple, Union

from sky import clouds
from sky import sky_logging
from sky.adaptors import do
from sky.clouds import service_catalog
from sky.provision.do import utils as do_utils
from sky.utils import resources_utils

if typing.TYPE_CHECKING:
from sky import resources as resources_lib

_CREDENTIAL_FILE = 'config.yaml'

logger = sky_logging.init_logger(__name__)


@clouds.CLOUD_REGISTRY.register(aliases=['digitalocean'])
class DO(clouds.Cloud):
"""Digital Ocean Cloud"""

_REPR = 'DO'
_CLOUD_UNSUPPORTED_FEATURES = {
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
clouds.CloudImplementationFeatures.CLONE_DISK_FROM_CLUSTER:
'Migrating '
f'disk is not supported in {_REPR}.',
clouds.CloudImplementationFeatures.SPOT_INSTANCE:
'Spot instances are '
f'not supported in {_REPR}.',
clouds.CloudImplementationFeatures.CUSTOM_DISK_TIER:
'Custom disk tiers'
f' is not supported in {_REPR}.',
}
# DO maximum node name length defined as <= 255
# https://docs.digitalocean.com/reference/api/api-reference/#operation/droplets_create
# 255 - 8 = 247 characters since
# our provisioner adds additional `-worker`.
_MAX_CLUSTER_NAME_LEN_LIMIT = 247
_regions: List[clouds.Region] = []

# Using the latest SkyPilot provisioner API to provision and check status.
PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT
STATUS_VERSION = clouds.StatusVersion.SKYPILOT

@classmethod
def _unsupported_features_for_resources(
cls, resources: 'resources_lib.Resources'
) -> Dict[clouds.CloudImplementationFeatures, str]:
"""The features not supported based on the resources provided.

This method is used by check_features_are_supported() to check if the
cloud implementation supports all the requested features.

Returns:
A dict of {feature: reason} for the features not supported by the
cloud implementation.
"""
del resources # unused
return cls._CLOUD_UNSUPPORTED_FEATURES

@classmethod
def _max_cluster_name_length(cls) -> Optional[int]:
return cls._MAX_CLUSTER_NAME_LEN_LIMIT

@classmethod
def regions_with_offering(
cls,
instance_type: str,
accelerators: Optional[Dict[str, int]],
use_spot: bool,
region: Optional[str],
zone: Optional[str],
) -> List[clouds.Region]:
assert zone is None, 'DO does not support zones.'
del accelerators, zone # unused
if use_spot:
return []
regions = service_catalog.get_region_zones_for_instance_type(
instance_type, use_spot, 'DO')
if region is not None:
regions = [r for r in regions if r.name == region]
return regions

@classmethod
def get_vcpus_mem_from_instance_type(
cls,
instance_type: str,
) -> Tuple[Optional[float], Optional[float]]:
return service_catalog.get_vcpus_mem_from_instance_type(instance_type,
clouds='DO')

@classmethod
def zones_provision_loop(
cls,
*,
region: str,
num_nodes: int,
instance_type: str,
accelerators: Optional[Dict[str, int]] = None,
use_spot: bool = False,
) -> Iterator[None]:
del num_nodes # unused
regions = cls.regions_with_offering(instance_type,
accelerators,
use_spot,
region=region,
zone=None)
asaiacai marked this conversation as resolved.
Show resolved Hide resolved
for r in regions:
assert r.zones is None, r
yield r.zones

def instance_type_to_hourly_cost(
self,
instance_type: str,
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None,
) -> float:
return service_catalog.get_hourly_cost(
instance_type,
use_spot=use_spot,
region=region,
zone=zone,
clouds='DO',
)

def accelerators_to_hourly_cost(
self,
accelerators: Dict[str, int],
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None,
) -> float:
"""Returns the hourly cost of the accelerators, in dollars/hour."""
# the acc price is include in the instance price.
del accelerators, use_spot, region, zone # unused
return 0.0
asaiacai marked this conversation as resolved.
Show resolved Hide resolved

def get_egress_cost(self, num_gigabytes: float) -> float:
# first 500GB free per month, $0.01/GB excess
# https://docs.digitalocean.com/platform/billing/bandwidth/
return 0.0
asaiacai marked this conversation as resolved.
Show resolved Hide resolved

@classmethod
def get_default_instance_type(
cls,
cpus: Optional[str] = None,
memory: Optional[str] = None,
disk_tier: Optional[resources_utils.DiskTier] = None,
) -> Optional[str]:
"""Returns the default instance type for DO."""
return service_catalog.get_default_instance_type(cpus=cpus,
memory=memory,
disk_tier=disk_tier,
clouds='DO')

@classmethod
def get_accelerators_from_instance_type(
cls, instance_type: str) -> Optional[Dict[str, Union[int, float]]]:
return service_catalog.get_accelerators_from_instance_type(
instance_type, clouds='DO')

@classmethod
def get_zone_shell_cmd(cls) -> Optional[str]:
return None

def make_deploy_resources_variables(
self,
resources: 'resources_lib.Resources',
cluster_name: resources_utils.ClusterName,
region: 'clouds.Region',
zones: Optional[List['clouds.Zone']],
dryrun: bool = False) -> Dict[str, Optional[str]]:
del zones, dryrun, cluster_name

r = resources
acc_dict = self.get_accelerators_from_instance_type(r.instance_type)
custom_resources = resources_utils.make_ray_custom_resources_str(
acc_dict)
image_id = None
if (resources.image_id is not None and
resources.extract_docker_image() is None):
if None in resources.image_id:
image_id = resources.image_id[None]
else:
assert region.name in resources.image_id
image_id = resources.image_id[region.name]
resources_vars = {
'instance_type': resources.instance_type,
'custom_resources': custom_resources,
'region': region.name,
}
if image_id is not None:
resources_vars['image_id'] = image_id
return resources_vars

def _get_feasible_launchable_resources(
self, resources: 'resources_lib.Resources'
) -> resources_utils.FeasibleResources:
"""Returns a list of feasible resources for the given resources."""
if resources.use_spot:
# TODO(asaiacai): Add hints to all return values in this method
# to help users understand why the resources are not launchable.
return resources_utils.FeasibleResources([], [], None)
if resources.instance_type is not None:
assert resources.is_launchable(), resources
resources = resources.copy(accelerators=None)
return resources_utils.FeasibleResources([resources], [], None)

def _make(instance_list):
resource_list = []
for instance_type in instance_list:
r = resources.copy(
cloud=DO(),
instance_type=instance_type,
accelerators=None,
cpus=None,
)
resource_list.append(r)
return resource_list

# Currently, handle a filter on accelerators only.
accelerators = resources.accelerators
if accelerators is None:
# Return a default instance type
default_instance_type = DO.get_default_instance_type(
cpus=resources.cpus,
memory=resources.memory,
disk_tier=resources.disk_tier)
return resources_utils.FeasibleResources(
_make([default_instance_type]), [], None)

assert len(accelerators) == 1, resources
acc, acc_count = list(accelerators.items())[0]
(instance_list, fuzzy_candidate_list) = (
service_catalog.get_instance_type_for_accelerator(
acc,
acc_count,
use_spot=resources.use_spot,
cpus=resources.cpus,
memory=resources.memory,
region=resources.region,
zone=resources.zone,
clouds='DO',
))
if instance_list is None:
return resources_utils.FeasibleResources([], fuzzy_candidate_list,
None)
return resources_utils.FeasibleResources(_make(instance_list),
fuzzy_candidate_list, None)

@classmethod
def check_credentials(cls) -> Tuple[bool, Optional[str]]:
"""Verify that the user has valid credentials for DO."""
try:
# attempt to make a CURL request for listing instances
do_utils.client().droplets.list()
except do.exceptions().HttpResponseError as err:
return False, str(err)

return True, None

def get_credential_file_mounts(self) -> Dict[str, str]:
try:
do_utils.client() # to initialize `do_utils.CREDENTIALS_PATH`
return {
f'~/.config/doctl/{_CREDENTIAL_FILE}': do_utils.CREDENTIALS_PATH
}
except do_utils.DigitalOceanError as err:
logger.debug(err)
return {}

@classmethod
def get_current_user_identity(cls) -> Optional[List[str]]:
# NOTE: used for very advanced SkyPilot functionality
# Can implement later if desired
return None

@classmethod
def get_image_size(cls, image_id: str, region: Optional[str]) -> float:
del region
try:
response = do_utils.client().images.get(image_id=image_id)
if not response:
raise do_utils.DigitalOceanError(
f'No image_id `{image_id}` found')
return response['image']['size_gigabytes']
except do.exceptions().HttpResponseError as err:
raise do_utils.DigitalOceanError(
'HTTP error while retrieving size of '
f'image_id {response}: {err.error.message}') from err

def instance_type_exists(self, instance_type: str) -> bool:
return service_catalog.instance_type_exists(instance_type, 'DO')

def validate_region_zone(self, region: Optional[str], zone: Optional[str]):
return service_catalog.validate_region_zone(region, zone, clouds='DO')
2 changes: 1 addition & 1 deletion sky/clouds/service_catalog/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
CATALOG_DIR = '~/.sky/catalogs'
ALL_CLOUDS = ('aws', 'azure', 'gcp', 'ibm', 'lambda', 'scp', 'oci',
'kubernetes', 'runpod', 'vsphere', 'cudo', 'fluidstack',
'paperspace')
'paperspace', 'do')
Loading
Loading