Add support for direct S3 access on SageMaker tasks #1081

thvasilo · 2024-10-30T18:41:31Z

Because DistDGL and by extension GraphStorm has an assumption of a shared filesystem to function properly, in our SageMaker implementations need to implement various downloads and uploads to "fake" the existence of a shared filesystem, by downloading data locally to specific locations per instance.

This introduces a maintenance burden as we can't make the same environment assumptions for our SageMaker vs. EC2 with EFS execution, and introduces a lot of glue code, to make the two system compatible.

Mountpoint for S3 is an AWS project that allows entire S3 buckets to mounted onto EC2 instances and treated a (mostly) regular filesystem. If we are able to use S3 buckets as virtual shared filesystems for SageMaker we should be able to simplify and align the codebase. We note the use-cases suggested by the mountpoint-s3 project align with ours:

Mountpoint for Amazon S3 is optimized for applications that need high read throughput to large objects, potentially from many clients at once, and to write new objects sequentially from a single client at a time. This means it's a great fit for applications that use a file interface to:
* read large objects from S3, potentially from many instances concurrently, without downloading them to local storage first
* access only some S3 objects out of a larger data set, but can't predict which objects in advance
* upload their output to S3 directly, or upload files from local storage with tools like cp

but probably not the right fit for applications that:
* use file operations that S3 doesn't natively support, like directory renaming or symlinks
* ( make edits to existing files (don't work on your Git repository or run vim in Mountpoint 😄)

We propose starting with a POC that modifies our SageMaker images and entry points to use mountpoint-s3, but does not affect the user-facing launch scripts, providing a backwards-compatible solution for our users.

Our first target will be adding GraphBolt support to SageMaker DistPartition, which is currently not possible, because DistDGL to GraphBolt partition conversion assumes that the leader instance has access to the entire distributed graph on disk. Following that, we can migrate our other SageMaker tasks to mountpoint-s3, where shared filesystems are normally required:

DistPartition, remove download/upload of data from S3
DistTraining
DistInference

thvasilo · 2024-10-30T21:26:26Z

After some investigation it seems like using mountpoint-s3 might not be a viable solution because it requires containers to be launched in a specific way which SageMaker does not support. Will look instead into other SageMaker file modes, although for GraphBolt we require access to files that are created by the job and not pre-existing

https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

EDIT: The file modes available on SageMaker do not allow reading files that are created on S3 during the training/processing job, which makes them hard to use for our purposes. In addition, streaming file modes create read-only file systems on SM containers, which does no allow e.g. DGL to convert DistDGL files to GraphBolt in-place.

…ageMaker (#1083) *Issue #, if available:* *Description of changes:* * Add a new SageMaker job to convert DistPart data to GraphBolt. This is our only option currently as there's no way to directly use S3 as a writable, shared file system in SageMaker, see #1081 for details. * The `sagemaker/launch_graphbolt_convert.py` will launch the SageMaker job, that downloads the entire partitioned graph to one instance, then runs the GB conversion, one partition at a time. Because DGL writes the new fused CSC graph representation in the same directory as the input data, we can't use one of SageMaker's FastFile modes to stream the data, as that creates read-only filesystems. * [Optional] We also include an example of how one could use a SageMaker Pipeline to run the GSPartition and GBConvert jobs in sequence, but this can be removed (because SageMaker Pipelines are persistent once created). * Added unit test mechanism to test sagemaker scripts, we start with testing our parsing logic. To make the scripts available to the runner's python runtime we add the `graphstorm/sagemaker/launch` directory to the runner's `PYTHONPATH`. EDIT: One note about the PR: The changes to the partition launch that use a SageMaker Pipeline are for demonstration purposes, I think I'll remove them alltogether and just have separate partition/gbconvert jobs. But we might want to have an example of how to programmatically build an SM pipeline as an example, e.g. from gsprocessing to training (as SM jobs) By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: xiang song(charlie.song) <[email protected]>

thvasilo added the sagemaker label Oct 30, 2024

thvasilo changed the title ~~Add support for mountpoint-s3 on SageMaker tasks~~ Add support for direct S3 access on SageMaker tasks Oct 30, 2024

thvasilo mentioned this issue Nov 1, 2024

[SageMaker] [GraphBolt] Add support for launching GraphBolt jobs on SageMaker #1083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for direct S3 access on SageMaker tasks #1081

Add support for direct S3 access on SageMaker tasks #1081

thvasilo commented Oct 30, 2024 •

edited

Loading

thvasilo commented Oct 30, 2024 •

edited

Loading

Add support for direct S3 access on SageMaker tasks #1081

Add support for direct S3 access on SageMaker tasks #1081

Comments

thvasilo commented Oct 30, 2024 • edited Loading

thvasilo commented Oct 30, 2024 • edited Loading

thvasilo commented Oct 30, 2024 •

edited

Loading

thvasilo commented Oct 30, 2024 •

edited

Loading