Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the way we save and load ID mappings for GConstruct. #641

Merged
merged 20 commits into from
Jan 17, 2024

Conversation

thvasilo
Copy link
Contributor

@thvasilo thvasilo commented Nov 14, 2023

Issue #, if available:

Description of changes:

  • This PR aligns the way node id mapping files are produced and loaded by GSF to that of GSProcessing
  • This is needed in order to ensure that downstream tasks can load mappings that were created either by GSProcessing or GConstruct.

Specifically:

  • Mappings are created with one directory per-node-type, under a raw_mappings_id directory.
  • Each node mappings consists of one or more Parquet files. We try to split the files s.t. no individual file is larger than 1GB.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Breaking changes:

  • Previously, GConstruct created the ID mappings as a single Parquet file, with its filename prefixed by the node type. With this change we create partitioned Parquet files, each node type under its own directory. The difference in behavior is exemplified below:

image

@thvasilo thvasilo added the ready able to trigger the CI label Nov 14, 2023
@thvasilo thvasilo force-pushed the id-remapping-alignment branch 5 times, most recently from 5f7ee71 to f84ed5b Compare November 15, 2023 23:58
@classicsong
Copy link
Contributor

Can you write down the details of how node id remapping files are stored in the PR description? And why we need this modification?

python/graphstorm/gconstruct/file_io.py Show resolved Hide resolved
python/graphstorm/gconstruct/file_io.py Outdated Show resolved Hide resolved
python/graphstorm/gconstruct/construct_graph.py Outdated Show resolved Hide resolved
python/graphstorm/gconstruct/remap_result.py Outdated Show resolved Hide resolved
python/graphstorm/sagemaker/utils.py Outdated Show resolved Hide resolved
python/graphstorm/sagemaker/utils.py Outdated Show resolved Hide resolved
python/graphstorm/sagemaker/utils.py Outdated Show resolved Hide resolved
python/graphstorm/sagemaker/utils.py Outdated Show resolved Hide resolved
tests/end2end-tests/graphstorm-ec/mgpu_test.sh Outdated Show resolved Hide resolved
@classicsong classicsong requested a review from zheng-da November 16, 2023 06:24
@thvasilo thvasilo force-pushed the id-remapping-alignment branch from d3076f9 to e9b58b6 Compare November 17, 2023 23:15
@thvasilo thvasilo self-assigned this Nov 20, 2023
@thvasilo thvasilo force-pushed the id-remapping-alignment branch from 9c7139b to 877a9a4 Compare January 8, 2024 21:39
@classicsong classicsong mentioned this pull request Jan 16, 2024
8 tasks
@thvasilo thvasilo merged commit e14c653 into awslabs:main Jan 17, 2024
15 checks passed
jalencato added a commit that referenced this pull request Feb 15, 2024
*Issue #, if available:*

*Description of changes:*

Align the doc about raw_id_mappings change in this PR:
#641

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
break back compatibility ready able to trigger the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants