Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for direct S3 access on SageMaker tasks #1081

Open
3 tasks
thvasilo opened this issue Oct 30, 2024 · 1 comment
Open
3 tasks

Add support for direct S3 access on SageMaker tasks #1081

thvasilo opened this issue Oct 30, 2024 · 1 comment

Comments

@thvasilo
Copy link
Contributor

thvasilo commented Oct 30, 2024

Because DistDGL and by extension GraphStorm has an assumption of a shared filesystem to function properly, in our SageMaker implementations need to implement various downloads and uploads to "fake" the existence of a shared filesystem, by downloading data locally to specific locations per instance.

This introduces a maintenance burden as we can't make the same environment assumptions for our SageMaker vs. EC2 with EFS execution, and introduces a lot of glue code, to make the two system compatible.

Mountpoint for S3 is an AWS project that allows entire S3 buckets to mounted onto EC2 instances and treated a (mostly) regular filesystem. If we are able to use S3 buckets as virtual shared filesystems for SageMaker we should be able to simplify and align the codebase. We note the use-cases suggested by the mountpoint-s3 project align with ours:

Mountpoint for Amazon S3 is optimized for applications that need high read throughput to large objects, potentially from many clients at once, and to write new objects sequentially from a single client at a time. This means it's a great fit for applications that use a file interface to:
* read large objects from S3, potentially from many instances concurrently, without downloading them to local storage first
* access only some S3 objects out of a larger data set, but can't predict which objects in advance
* upload their output to S3 directly, or upload files from local storage with tools like cp

but probably not the right fit for applications that:
* use file operations that S3 doesn't natively support, like directory renaming or symlinks
* ( make edits to existing files (don't work on your Git repository or run vim in Mountpoint 😄)

We propose starting with a POC that modifies our SageMaker images and entry points to use mountpoint-s3, but does not affect the user-facing launch scripts, providing a backwards-compatible solution for our users.

Our first target will be adding GraphBolt support to SageMaker DistPartition, which is currently not possible, because DistDGL to GraphBolt partition conversion assumes that the leader instance has access to the entire distributed graph on disk. Following that, we can migrate our other SageMaker tasks to mountpoint-s3, where shared filesystems are normally required:

  • DistPartition, remove download/upload of data from S3
  • DistTraining
  • DistInference
@thvasilo thvasilo changed the title Add support for mountpoint-s3 on SageMaker tasks Add support for direct S3 access on SageMaker tasks Oct 30, 2024
@thvasilo
Copy link
Contributor Author

thvasilo commented Oct 30, 2024

After some investigation it seems like using mountpoint-s3 might not be a viable solution because it requires containers to be launched in a specific way which SageMaker does not support. Will look instead into other SageMaker file modes, although for GraphBolt we require access to files that are created by the job and not pre-existing

https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

EDIT: The file modes available on SageMaker do not allow reading files that are created on S3 during the training/processing job, which makes them hard to use for our purposes. In addition, streaming file modes create read-only file systems on SM containers, which does no allow e.g. DGL to convert DistDGL files to GraphBolt in-place.

thvasilo added a commit that referenced this issue Nov 11, 2024
…ageMaker (#1083)

*Issue #, if available:*


*Description of changes:*

* Add a new SageMaker job to convert DistPart data to GraphBolt. This is
our only option currently as there's no way to directly use S3 as a
writable, shared file system in SageMaker, see
#1081 for details.
* The `sagemaker/launch_graphbolt_convert.py` will launch the SageMaker
job, that downloads the entire partitioned graph to one instance, then
runs the GB conversion, one partition at a time. Because DGL writes the
new fused CSC graph representation in the same directory as the input
data, we can't use one of SageMaker's FastFile modes to stream the data,
as that creates read-only filesystems.
* [Optional] We also include an example of how one could use a SageMaker
Pipeline to run the GSPartition and GBConvert jobs in sequence, but this
can be removed (because SageMaker Pipelines are persistent once
created).
* Added unit test mechanism to test sagemaker scripts, we start with
testing our parsing logic. To make the scripts available to the runner's
python runtime we add the `graphstorm/sagemaker/launch` directory to the
runner's `PYTHONPATH`.

EDIT: One note about the PR: The changes to the partition launch that
use a SageMaker Pipeline are for demonstration purposes, I think I'll
remove them alltogether and just have separate partition/gbconvert jobs.
But we might want to have an example of how to programmatically build an
SM pipeline as an example, e.g. from gsprocessing to training (as SM
jobs)







By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: xiang song(charlie.song) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant