Skip to content

Commit

Permalink
[Doc Fix] Sync doc for Remap (#738)
Browse files Browse the repository at this point in the history
*Issue #, if available:*

*Description of changes:*

Align the doc about raw_id_mappings change in this PR:
#641

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
  • Loading branch information
jalencato authored Feb 15, 2024
1 parent 113a388 commit 858e4e0
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 7 deletions.
2 changes: 1 addition & 1 deletion docs/source/configuration/configuration-gconstruction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Graph Construction
* **-\-num-processes-for-edges**: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing.
* **-\-output-dir**: (**Required**) the path of the output data files.
* **-\-graph-name**: (**Required**) the name assigned for the graph.
* **-\-remap-node_id**: boolean value to decide whether to rename node IDs or not. Default is true.
* **-\-remap-node-id**: boolean value to decide whether to rename node IDs or not. Default is true.
* **-\-add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Default is true.
* **-\-output-format**: the format of constructed graph, options are ``DGL``, ``DistDGL``. Default is ``DistDGL``. It also accepts multiple graph formats at the same time separated by an space, for example ``--output-format "DGL DistDGL"``. The output format is explained in the :ref:`Output <output-format>` section below.
* **-\-num-parts**: the number of partitions of the constructed graph. This is only valid if the output format is ``DistDGL``.
Expand Down
4 changes: 3 additions & 1 deletion docs/source/scale/sagemaker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,7 @@ Users can use the following command to launch a GraphStorm Link Prediction infer
--graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
--yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml \
--model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL>/ \
--raw-node-mappings-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p/raw_id_mappings \
--output-emb-s3 s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/ \
--output-prediction-s3 s3://<PATH_TO_SAVE_PREDICTION_RESULTS> \
--graph-name ogbn-mag \
Expand All @@ -196,7 +197,8 @@ Users can use the following command to launch a GraphStorm Link Prediction infer
.. note::

Diffferent from the training command's argument, in the inference command, the value of the ``--model-artifact-s3`` argument needs to be path to a saved model. By default, it is stored under an S3 path with specific training epoch or epoch plus iteration number, e.g., ``s3://models/epoch-0-iter-999``, where the trained model artifacts were saved.
* Different from the training command's argument, in the inference command, the value of the ``--model-artifact-s3`` argument needs to be path to a saved model. By default, it is stored under an S3 path with specific training epoch or epoch plus iteration number, e.g., ``s3://models/epoch-0-iter-999``, where the trained model artifacts were saved.
* If ``--raw-node-mappings-s3`` is not provided, it will be default to the ``{graph-data-s3}/raw_id_mappings``. The expected graph mappings files should be ``node_mapping.pt``, ``edge_mapping.pt`` and parquet files under ``raw_id_mappings``. They record the mapping between original node and edge ids in the raw data files and the ids of nodes and edges in the Graph Node ID space. These files are created during graph construction by either GConstruct or GSProcessing.

As the outcomes of the inference command, the generated node embeddings will be uploaded to ``s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/``. For node classification/regression or edge classification/regression tasks, users can use ``--output-prediction-s3`` to specify the saving locations of prediction results.

Expand Down
13 changes: 8 additions & 5 deletions docs/source/tutorials/own-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -211,25 +211,28 @@ The above command reads in the JSON file, and matchs its contents with the node
/tmp/acm_gs
acm.json
author_id_remap.parquet
edge_label_stats.json
edge_label_stats.json
edge_mapping.pt
node_label_stats.json
node_mapping.pt
paper_id_remap.parquet
|- part0
edge_feat.dgl
graph.dgl
node_feat.dgl
subject_id_remap.parquet
|- raw_id_mappings
|- author
part-00000.parquet
|- paper
part-00000.parquet
|- subject
part-00000.parquet
Because the above command specifies the ``--num-parts`` to be ``1``, there is only one partition created, which is saved in the ``part0`` folder. These files become the inputs of GraphStorm's launch scripts.

.. note::

- Because the parquet format has some limitations, such as only supporting 2 billion elements in a column, etc, we suggest users to use HDF5 format for very large datasets.
- The two mapping files, ``node_mapping.pt`` and ``edge_mapping.pt``, are used to record the mapping between the ogriginal node and edge ids in the raw data files and the ids of nodes and edges in the Graph Node ID space. They are important for mapping the training and inference outputs back to the Raw Node ID space in the original input data. Therefore, **DO NOT** move or delete them.
- The mapping files, ``node_mapping.pt``, ``edge_mapping.pt`` and the files under ``raw_id_mappings``, are used to record the mapping between the original node and edge ids in the raw data files and the ids of nodes and edges in the Graph Node ID space. They are important for mapping the training and inference outputs back to the Raw Node ID space in the original input data. Therefore, **DO NOT** move or delete them.

.. _option-2:

Expand Down

0 comments on commit 858e4e0

Please sign in to comment.