Skip to content

Commit

Permalink
[perf] optimizations for sky jobs launch (#4341)
Browse files Browse the repository at this point in the history
* cache AWS get_user_identities

With SSO enabled (and maybe without?) this takes about a second. We already use
an lru_cache for Azure, do the same here.

* skip optimization for sky jobs launch --yes

The only reason we call optimize for jobs_launch is to give a preview of the
resources we expect to use, and give the user an opportunity to back out if it's
not what they expect. If you use --yes or -y, you don't have a chance to back
out and you're probably running from a script, where you don't care.
Optimization can take ~2 seconds, so just skip it.

* update logging

* address PR comments
  • Loading branch information
cg505 authored Nov 15, 2024
1 parent 334b268 commit c8ea12b
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 4 deletions.
19 changes: 15 additions & 4 deletions sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3699,13 +3699,24 @@ def jobs_launch(
dag_utils.maybe_infer_and_fill_dag_and_task_names(dag)
dag_utils.fill_default_config_in_dag_for_job_launch(dag)

click.secho(f'Managed job {dag.name!r} will be launched on (estimated):',
fg='cyan')
dag, _ = admin_policy_utils.apply(
dag, use_mutated_config_in_current_request=False)
dag = sky.optimize(dag)

if not yes:
if yes:
# Skip resource preview if -y is set, since we are probably running in
# a script and the user won't have a chance to review it anyway.
# This can save a couple of seconds.
click.secho(
f'Resources for managed job {dag.name!r} will be computed on the '
'managed jobs controller, since --yes is set.',
fg='cyan')

else:
click.secho(
f'Managed job {dag.name!r} will be launched on (estimated):',
fg='cyan')
dag = sky.optimize(dag)

prompt = f'Launching a managed job {dag.name!r}. Proceed?'
if prompt is not None:
click.confirm(prompt, default=True, abort=True, show_default=True)
Expand Down
1 change: 1 addition & 0 deletions sky/clouds/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -663,6 +663,7 @@ def _is_access_key_of_type(type_str: str) -> bool:
return AWSIdentityType.SHARED_CREDENTIALS_FILE

@classmethod
@functools.lru_cache(maxsize=1) # Cache since getting identity is slow.
def get_user_identities(cls) -> Optional[List[List[str]]]:
"""Returns a [UserId, Account] list that uniquely identifies the user.
Expand Down
6 changes: 6 additions & 0 deletions sky/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,12 @@ def _execute(
# no-credential machine should not enter optimize(), which
# would directly error out ('No cloud is enabled...'). Fix
# by moving `sky check` checks out of optimize()?

controller = controller_utils.Controllers.from_name(
cluster_name)
if controller is not None:
logger.info(
f'Choosing resources for {controller.name}...')
dag = sky.optimize(dag, minimize=optimize_target)
task = dag.tasks[0] # Keep: dag may have been deep-copied.
assert task.best_resources is not None, task
Expand Down

0 comments on commit c8ea12b

Please sign in to comment.