Fix a bug in inference code #696

zheng-da · 2023-12-27T05:15:54Z

Description of changes:
This PR reorganizes the code of performing inference of GNN and LM models.
Specifically, we split the nodes for inference based on locality. In this case, the embeddings are saved to local partitions via shared memory and we only need to run barrier before returning fron the inference function.
However, when computing LM embeddings, we split the nodes evenly to ensure all processes take roughly the same amount of time to compute LM embeddings. Otherwise, we will see a timeout in barrier in some processes. Because now the nodes are split evenly, we need to write data to remote memory. Before returning from the inference function, we need to call flush_data to ensure all data written to distributed memory can be read correctly.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

python/graphstorm/model/gnn_encoder_base.py

This reverts commit 0375ace.

Ubuntu and others added 2 commits October 2, 2023 15:49

fix inference nodes.

0375ace

fix.

7860173

zheng-da requested a review from classicsong December 27, 2023 05:18

zheng-da added the ready able to trigger the CI label Dec 27, 2023

classicsong reviewed Dec 27, 2023

View reviewed changes

python/graphstorm/model/gnn_encoder_base.py Outdated Show resolved Hide resolved

zheng-da added 4 commits December 28, 2023 16:06

Merge branch 'main' into fix_infer

9cd17f4

Merge remote-tracking branch 'zhengda/fix_infer' into fix_infer

e53a321

fix.

6a34ba9

Revert "fix inference nodes."

e248705

This reverts commit 0375ace.

classicsong approved these changes Dec 30, 2023

View reviewed changes

zheng-da merged commit e82cb8e into awslabs:main Dec 30, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in inference code #696

Fix a bug in inference code #696

zheng-da commented Dec 27, 2023 •

edited

Loading

Fix a bug in inference code #696

Fix a bug in inference code #696

Conversation

zheng-da commented Dec 27, 2023 • edited Loading

zheng-da commented Dec 27, 2023 •

edited

Loading