Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] Allowing to specify intermediate bucket for file upload #4257

Open
wants to merge 51 commits into
base: master
Choose a base branch
from

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Nov 4, 2024

Feature for #3978

  • Add a config jobs: bucket in ~/.sky/config.yaml.
    • Support s3, gcs, azure, r2, ibm
    • If this bucket is specified, the bucket name and storage type will be used. Otherwise, one will be generated named: skypilot-filemounts-{username}-{run_id}.
    • Add an attribute for storage YAML config: _bucket_sub_path(Support both COPY and MOUNT based on the _bucket_sub_path, The user will only see the subpath instead of the whole bucket if a subpath is specified)
      • _ means internal use only, remove _ to expose in the future
    • smoke test
      • Add a smoke test case: pytest -s tests/test_smoke.py::test_managed_jobs_intermediate_storage
        • Verify that the new config file is loaded and intermediate bucket names in the config are used.
      • Add a test case: pytest -s tests/test_smoke.py::TestStorageWithCredentials::test_bucket_sub_path --gcp
        • Verify that the bucket subdirectory file sync and deletion have the expected behavior. Mainly, ensure the _bucket_sub_path works as expected.

For managed jobs:

  • All these file sync operations will now use the intermediate buckets. Previously, each file sync would create a new temporary bucket:
    • Previous:
      • workdir specified in the task.yaml workdir -> {tmp-bucket-A}/workdir
      • Local file mount specified in the task.yaml file_mount -> {tmp-bucket-B}{i}
      • Temporary files tmp_files -> {tmp-bucket-C}/tmp-files
    • Now:
      • workdir specified in the task.yaml workdir -> {intermediate-bucket}/job-{run_id}/workdir
      • Local file mount specified in the task.yaml file_mount -> {intermediate-bucket}/job-{run_id}/local-file-mounts/{i}
      • Temporary files tmp_files -> {intermediate-bucket}/job-{run_id}/tmp-files
  • Clean up
    • After the managed job finishes, when cleanup is triggered
      • If user doesn't specific bucket config, sky generated {intermediate-bucket}/job-{run_id} bucket will be deleted automatically as before.
      • If user specific bucket name in config, the directory created above will be deleted, the bucket still exists after that, and the user needs to manually delete the bucket as usual.(Due to potential race problem)
    • Currently aws, gcs, r2 use the command line support to remove the directory, while cos and https(azure) use the SDK to list all files under directory and then delete

Even using the same bucket, each job creates its own subdirectory based on the '{run-id}', and subdirectories of auto-created files will be cleaned up after the job is finished.

This allows users who don't have permission to create a bucket to manually specify a bucket name under ~/.sky/config.yaml.

Test plan

smoke test

pytest -s tests/test_smoke.py::test_managed_jobs_storage
pytest -s tests/test_smoke.py::TestStorageWithCredentials::test_bucket_sub_path --aws

custom test

(sky) ➜  cat ~/.sky/config.yaml
jobs:
  bucket: s3://zpoint-bucket-s3/
(sky) ➜ cat ~/Desktop/hello-sky/work_dir_1.yaml
name: test_workdir_bucket_name_1

workdir: .

resources:
  cloud: aws
  instance_type: t3.small

file_mounts:
  # this will use the user config
  /checkpoint:
    name: zpoint-filemounts-bucket
    source: ~/Desktop/dir1
    mode: MOUNT
    store: azure

  # these will all use same bucket configured in ~/.sky/config.yaml jobs->bucket now for bucket storage
  /dir1: ~/Desktop/dir1
  /dir2: ~/Desktop/dir2
  /dir3/dir3.py: ~/Desktop/dir1/dir1.py

run: |
 sleep 10
  ls /checkpoint
  ls .

Launch

(sky) ➜ sky jobs launch ~/Desktop/hello-sky/work_dir_1.yaml
⠋ Syncing ~/Desktop/dir1 -> https://xxx/zpoint-filemounts-bucket/
⠹ Syncing . -> s3://zpoint-bucket-s3/
⠧ Syncing ~/Desktop/dir1 -> s3://zpoint-bucket-s3/
⠦ Syncing ~/Desktop/dir2 -> s3://zpoint-bucket-s3/
⠴ Syncing /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/skypilot-filemounts-files-aca97801 -> s3://zpoint-bucket-s3/

Looks good

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@romilbhardwaj romilbhardwaj self-requested a review November 6, 2024 18:42
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! Left some comments about the design, correctness and lifecycle management for the bucket contents.

sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj November 8, 2024 09:13
sky/data/storage.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj November 9, 2024 15:39
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
Comment on lines 715 to 719
jobs_configs['properties']['bucket'] = {
'type': 'string',
'pattern': '^(https|s3|gs|r2|cos)://.+',
'required': []
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should include it in controller_resources_schema and enable this feature for SkyServe as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adde to controller_resources_schema

Could u point me out the code to change if I need to enable for SkyServe?

sky/data/storage.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator

Thanks @zpoint! I'm trying to run this yaml:

resources:
  cloud: aws

file_mounts:
  ~/aws: ~/aws

workdir: ~/tmp-workdir

num_nodes: 1

run: |
  echo "Hello, world!"
  ls ~/aws
  ls .

The task output should show me contents of my workdir and file_mount. Instead, I get:

├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-3ba1-romilb, pid=2100) Hello, world!
(sky-3ba1-romilb, pid=2100) job-89ab894f
(sky-3ba1-romilb, pid=2100) job-89ab894f
✓ Managed job finished: 1 (status: SUCCEEDED).

Related to this: #4257 (comment)?

@zpoint zpoint requested a review from romilbhardwaj November 12, 2024 10:01
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint. I think we should revert to the previous config format (see comment):

jobs:
  bucket: s3://mybucket # or gs://mybucket ...

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
tests/test_yamls/use_intermediate_bucket.yaml Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj November 26, 2024 11:59
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
docs/source/reference/config.rst Outdated Show resolved Hide resolved
docs/source/reference/config.rst Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint - I tested the functionality with an external s3 bucket, works well. Left some comments.

Comment on lines 192 to 193
def from_store_url(cls,
store_url: str) -> Tuple[str, str, str, Dict[str, Any]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MyClass.from_xyz() implies the method will return a MyClass object. Can we move this out from the class into a helper in data_utils or outside in storage.py?

Please reuse data_utils.split_xyz_path to get the bucket name and subpath.

Copy link
Collaborator Author

@zpoint zpoint Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be here; otherwise, either data_util or storage_util will cause a circular import.

Renamed.

sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from romilbhardwaj December 2, 2024 15:27
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint. Left some comments on the new interfaces added in the latest commits.

docs/source/reference/config.rst Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Comment on lines 409 to 410
def delete_sub_path(self) -> None:
"""Removes objects from the sub path in the bucket."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel delete_sub_path should be an internal method that is called by delete when a _bucket_sub_path is set.

In summary:

  • If a Storage object is initialized with _bucket_sub_path and the bucket already existed, is_sky_managed should be false and Storage.delete() should just delete the sub_path.
  • If a Storage object is initialized with _bucket_sub_path and the bucket did not already exist, is_sky_managed should be true and Storage.delete() should just delete the entire bucket.
  • If a Storage object is not initialized with any _bucket_sub_path, we follow current behavior.

Then delete_sub_path() will not be needed as a public method of this class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good

sky/data/storage.py Outdated Show resolved Hide resolved
Comment on lines 758 to 759
if store_type is not None:
storage_obj.construct_store(store_type, region,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment on construct_store. Instead of this, can we initialize the store above and pass it through the stores arg storage_lib.Storage(stores = ..., ...)?


# Check bucket is empty, all files under sub directory should be deleted
store = tmp_local_storage_obj_with_sub_path.stores[store_type]
store._delete_sub_path()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I test the delete path for Azure, this exception occurs. However, aws and gcp pass. azure worked before—could it be an account permission problem? @Michaelvll @romilbhardwaj

onseError] This request is not authorized to perform this operation using this permission.
RequestId:0aa67d01-401e-0048-3416-4ef84d000000
Time:2024-12-14T10:55:17.3755800Z
ErrorCode:AuthorizationPermissionMismatch
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>AuthorizationPermissionMismatch</Code><Message>This request is not authorized to perform this operation using this permission.
RequestId:0aa67d01-401e-0048-3416-4ef84d000000
Time:2024-12-14T10:55:17.3755800Z</Message></Error>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just checked azure portal and you should have full permissions...

@zpoint zpoint requested a review from romilbhardwaj December 14, 2024 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants