Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Update doc of training and inference output. #964

Merged
merged 10 commits into from
Aug 15, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions docs/source/advanced/multi-task-learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -318,3 +318,129 @@ GraphStorm supports to run multi-task inference on :ref:`SageMaker<distributed-s
--instance-count <INSTANCE_COUNT> \
--instance-type <INSTANCE_TYPE>

Multi-task Learning Output
--------------------------

Saved Node Embeddings
~~~~~~~~~~~~~~~~~~~~~~
When ``save_embed_path`` is provided in the training config or inference condig,
classicsong marked this conversation as resolved.
Show resolved Hide resolved
GraphStorm will save the node embeddings in the corresponding path.
In multi-task learning, by default, GraphStorm will save the node embeddings
produced by the GNN layer for every node type under the path specified by
``save_embed_path``。 The output format follows the :ref:`GraphStorm saved node embeddings
classicsong marked this conversation as resolved.
Show resolved Hide resolved
format<gs-out-embs>`. Meanwhile, in multi-task learning, certain tasks might apply
task specific normalization to node embeddings. For instance, a link prediction
task might apply l2 normalization on each node embeddings. In certain cases, GraphStorm
will also save the normalized node embeddings under ``save_embed_path``.
classicsong marked this conversation as resolved.
Show resolved Hide resolved
The task specific node embeddings are saved separately under different sub-directories
named with the corresponding task id. (A task id is formated as ``<task_type>-<ntype/etype(s)>-<label>``.
For instance, the task id of a node classification task on the node type ``paper`` with the
label filed ``venue`` will be ``node_classification-paper-venue``. As another example,
classicsong marked this conversation as resolved.
Show resolved Hide resolved
the task id of a link prediction task on the edge type ``(paper, cite, paper)`` will be
``link_prediction-paper_cite_paper``
and the task id of a edge regression task on the edge type ``(paper, cite, paper)`` with
the label field ``year`` will be ``edge_regression-paper_cite_paper-year``).
The output format of task specific node embeddings follows
the :ref:`GraphStorm saved node embeddings format<gs-out-embs>`.
The ``save_embed_path`` in multi-task learning will look like following:
classicsong marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

emb_dir/
ntype0/
embed_nids-00000.pt
classicsong marked this conversation as resolved.
Show resolved Hide resolved
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
ntype1/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
emb_info.json
link_prediction-paper_cite_paper/
ntype0/
embed_nids-00000.pt
classicsong marked this conversation as resolved.
Show resolved Hide resolved
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
ntype1/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
emb_info.json
edge_regression-paper_cite_paper-year/
ntype0/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
ntype1/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
emb_info.json

In the above example both the link prediction task and the edge regression
classicsong marked this conversation as resolved.
Show resolved Hide resolved
apply task specific normalization on node embeddings.

**Note: The built-in GraphStorm training or inference pipeline
(launched by GraphStorm CLI) will process each saved node embeddings
classicsong marked this conversation as resolved.
Show resolved Hide resolved
to convert the integer node ids into the raw node ids, which are usually string node ids..**
classicsong marked this conversation as resolved.
Show resolved Hide resolved
Details can be found in :ref:`GraphStorm Output Node ID Remapping<output-remapping>`
classicsong marked this conversation as resolved.
Show resolved Hide resolved

Saved Prediction Results
~~~~~~~~~~~~~~~~~~~~~~~~~
When ``save_prediction_path`` is provided in the inference condig,
classicsong marked this conversation as resolved.
Show resolved Hide resolved
GraphStorm will save the prediction results in the corresponding path.
In multi-task learning inference, each prediction task will have its prediction
results saved separately under different sub-directories
named with the
corresponding task id. The output format of task specific prediction results
follows the :ref:`GraphStorm saved prediction result format<gs-out-predictions>`.
The ``save_prediction_path`` in multi-task learning will look like following:
classicsong marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

prediction_dir/
edge_regression-paper_cite_paper-year/
paper_cite_paper/
classicsong marked this conversation as resolved.
Show resolved Hide resolved
predict-00000.pt
predict-00001.pt
...
src_nids-00000.pt
src_nids-00001.pt
...
dst_nids-00000.pt
dst_nids-00001.pt
...
result_info.json
node_classification-paper-venue/
paper/
predict-00000.pt
predict-00001.pt
...
predict_nids-00000.pt
predict_nids-00001.pt
...
result_info.json
...

**Note: The built-in GraphStorm inference pipeline
(launched by GraphStorm CLI) will process each saved prediction result
classicsong marked this conversation as resolved.
Show resolved Hide resolved
to convert the integer node ids into the raw node ids, which are usually string node ids.**
classicsong marked this conversation as resolved.
Show resolved Hide resolved
Details can be found in :ref:`GraphStorm Output Node ID Remapping<output-remapping>`
classicsong marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions docs/source/cli/model-training-inference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ In addition, there are two node ID mapping operations during the graph construct
single-machine-training-inference
distributed/cluster
distributed/sagemaker
output
output-remapping
169 changes: 169 additions & 0 deletions docs/source/cli/model-training-inference/output.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
.. _gs-output:

GraphStorm Output
=================

.. _gs-output-embs:

Saved Node Embeddings
---------------------
When ``save_embed_path`` is provided in the training config or inference condig,
classicsong marked this conversation as resolved.
Show resolved Hide resolved
GraphStorm will save the node embeddings in the corresponding path. The node embeddings
of each node type are saved separately under different sub-directories named with
the corresponding node types. GraphStorm will also save an ``emb_info.json`` file,
which contains all the metadata for the saved node embeddings. The ``save_embed_path``
classicsong marked this conversation as resolved.
Show resolved Hide resolved
will look like following:

.. code-block:: bash

emb_dir/
ntype0/
embed_nids-00000.pt
classicsong marked this conversation as resolved.
Show resolved Hide resolved
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
ntype1/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
...
emb_info.json

The ``embed_nids-*`` files store the integer node IDs of each node embedding and
zhjwy9343 marked this conversation as resolved.
Show resolved Hide resolved
the ``embed-*`` files store the corresponding node embeddings.
The content of ``embed_nids-*`` files and ``embed-*`` files looks like:
classicsong marked this conversation as resolved.
Show resolved Hide resolved

.. code-block::

embed_nids-00000.pt | embed-00000.pt
classicsong marked this conversation as resolved.
Show resolved Hide resolved
|
Graph Node ID | embeddings
10 | 0.112,0.123,-0.011,...
1 | 0.872,0.321,-0.901,...
23 | 0.472,0.432,-0.732,...
...

The ``emb_info.json`` stores three informations:
classicsong marked this conversation as resolved.
Show resolved Hide resolved
* ``format``: The format of the saved embeddings. By default, it is ``pytorch``.
* ``emb_name``: A list of node types that have node embeddings saved. For example: ["ntype0", "ntype1"]
* ``world_size``: The number of chunks (files) into which the node embeddings of a particular node type are divided. For instance, if world_size is set to 8, there will be 8 files for each set of node embeddings."
classicsong marked this conversation as resolved.
Show resolved Hide resolved

**Note: The built-in GraphStorm training or inference pipeline
(launched by GraphStorm CLI) will process the saved node embeddings
classicsong marked this conversation as resolved.
Show resolved Hide resolved
to convert the integer node ids into the raw node ids, which are usually string node ids..**
classicsong marked this conversation as resolved.
Show resolved Hide resolved
Details can be found in :ref:`GraphStorm Output Node ID Remapping<output-remapping>`

.. _gs-output-predictions:

Saved Prediction Results
------------------------
When ``save_prediction_path`` is provided in the inference condig,
classicsong marked this conversation as resolved.
Show resolved Hide resolved
GraphStorm will save the prediction results in the corresponding path.
For node prediction tasks, the prediction results are saved per node type.
GraphStorm will also save an ``result_info.json`` file, which contains all
the metadata for the saved prediction results. The ``save_prediction_path``
will look like following:

.. code-block:: bash

prediction_dir/
ntype0/
predict-00000.pt
classicsong marked this conversation as resolved.
Show resolved Hide resolved
predict-00001.pt
...
predict_nids-00000.pt
predict_nids-00001.pt
...
ntype1/
predict-00000.pt
predict-00001.pt
...
predict_nids-00000.pt
predict_nids-00001.pt
...
...
result_info.json

The ``predict_nids-*`` files store the integer node IDs of each prediction result and
classicsong marked this conversation as resolved.
Show resolved Hide resolved
the ``predict-*`` files store the corresponding prediction results.
The content of ``predict_nids-*`` files and ``predict-*`` files looks like:

.. code-block::

predict_nids-00000.pt | predict.pt
classicsong marked this conversation as resolved.
Show resolved Hide resolved
classicsong marked this conversation as resolved.
Show resolved Hide resolved
|
Graph Node ID | Prediction results
10 | 0.112
1 | 0.872
23 | 0.472
...

The ``result_info.json`` stores three informations:
classicsong marked this conversation as resolved.
Show resolved Hide resolved
* ``format``: The format of the saved prediction results. By default, it is ``pytorch``.
* ``emb_name``: A list of node types that have node prediction results saved. For example: ["ntype0", "ntype1"]
* ``world_size``: The number of chunks (files) into which the prediction results of a particular node type are divided. For instance, if world_size is set to 8, there will be 8 files for each set of prediction results."
classicsong marked this conversation as resolved.
Show resolved Hide resolved


For edge prediction tasks, the prediction results are saved per edge type.
The sub-directory for an edge type is named as ``<src_ntype>_<relation_type>_<dst_ntype>``.
For instance, given an edge type ``("movie","rated-by","user")``, the corresponding
sub-directory is named as ``movie_rated-by_user``.
GraphStorm will also save an ``result_info.json`` file, which contains all
the metadata for the saved prediction results. The ``save_prediction_path``
classicsong marked this conversation as resolved.
Show resolved Hide resolved
will look like following:

.. code-block:: bash

prediction_dir/
etype0/
predict-00000.pt
classicsong marked this conversation as resolved.
Show resolved Hide resolved
predict-00001.pt
...
src_nids-00000.pt
src_nids-00001.pt
...
dst_nids-00000.pt
dst_nids-00001.pt
...
etype1/
predict-00000.pt
predict-00001.pt
...
src_nids-00000.pt
src_nids-00001.pt
...
dst_nids-00000.pt
dst_nids-00001.pt
...
...
result_info.json

The ``src_nids-*`` and ``dst_nids-*`` files contain the integer node IDs for
classicsong marked this conversation as resolved.
Show resolved Hide resolved
the source and destination nodes of each prediction, respectively.
The ``predict-*`` files store the corresponding prediction results.
The content of ``src_nids-*``, ``dst_nids-*`` and ``predict-*`` files looks like:

.. code-block::

src_nids-00000.pt | dst_nids-00000.pt | predict.pt
classicsong marked this conversation as resolved.
Show resolved Hide resolved
classicsong marked this conversation as resolved.
Show resolved Hide resolved
|
Source Node ID | Destination Node ID | Prediction results
10 | 12 | 0.112
1 | 20 | 0.872
23 | 3 | 0.472
...

The ``result_info.json`` stores three informations:
classicsong marked this conversation as resolved.
Show resolved Hide resolved
* ``format``: The format of the saved prediction results. By default, it is ``pytorch``.
* ``etypes``: A list of edge types that have edge prediction results saved. For example: [("movie","rated-by","user"), ("user","watched","movie")]
* ``world_size``: The number of chunks (files) into which the prediction results of a particular edge type are divided. For instance, if world_size is set to 8, there will be 8 files for each set of prediction results."
classicsong marked this conversation as resolved.
Show resolved Hide resolved

**Note: The built-in GraphStorm inference pipeline
(launched by GraphStorm CLI) will process the saved prediction results
classicsong marked this conversation as resolved.
Show resolved Hide resolved
to convert the integer node ids into the raw node ids, which are usually string node ids.**
classicsong marked this conversation as resolved.
Show resolved Hide resolved
Details can be found in :ref:`GraphStorm Output Node ID Remapping<output-remapping>`
Loading