Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Move to Spark 3.5, bump version to 0.3.0, add homogeneous edge mapping optimization #791

Merged
merged 2 commits into from
Apr 3, 2024

Conversation

thvasilo
Copy link
Contributor

@thvasilo thvasilo commented Apr 2, 2024

Issue #, if available:

Description of changes:

  • Optimizations for edge re-mapping:
    • When an edge is homogeneous (same src and dst node type) we cache and re-use the node id mapping DF instead of loading from storage twice.
    • Remove enforced re-partitions where possible, because they'd trigger entire DF shuffles.
    • These changes combined have improved total job runtime by up to 33% in a random graph with 10B edges and 1B nodes.
  • Bumps GSProcessing version to 0.3.0.
  • Allows Spark up to 3.5 in project dependencies.
  • Adds 0.3.0 Dockerfiles with Spark 3.5 for emr-s and sagemaker.

Testing

pytest, test jobs on SageMaker and EMR-S using the latest image versions with ml-100k, and large-scale test on SM with 1B-edges-100M-nodes-1024feat graph.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@thvasilo thvasilo added 0.3 ready able to trigger the CI gsprocessing For issues and PRs related the the GSProcessing library labels Apr 2, 2024
@thvasilo thvasilo added this to the 0.3 release milestone Apr 2, 2024
@thvasilo thvasilo requested a review from jalencato April 2, 2024 21:49
@thvasilo thvasilo self-assigned this Apr 2, 2024
Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and a little comment.

Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thvasilo thvasilo changed the title [GSProcessing] Move to Spark 3.5, bump version to 0.3.0, add homogeneous egde mapping optimization [GSProcessing] Move to Spark 3.5, bump version to 0.3.0, add homogeneous edge mapping optimization Apr 3, 2024
@thvasilo thvasilo merged commit 842b6f5 into awslabs:main Apr 3, 2024
3 checks passed
@thvasilo thvasilo deleted the gsp-spark-3.5 branch April 3, 2024 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.3 gsprocessing For issues and PRs related the the GSProcessing library ready able to trigger the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants