[Doc Fix] Sync doc for Remap (#738)

*Issue #, if available:* *Description of changes:* Align the doc about raw_id_mappings change in this PR: #641 By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
awslabs · Feb 15, 2024 · 858e4e0 · 858e4e0
1 parent 113a388
commit 858e4e0
Show file tree

Hide file tree

Showing 3 changed files with 12 additions and 7 deletions.
diff --git a/docs/source/configuration/configuration-gconstruction.rst b/docs/source/configuration/configuration-gconstruction.rst
@@ -12,7 +12,7 @@ Graph Construction
 * **-\-num-processes-for-edges**: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing.
 * **-\-output-dir**: (**Required**) the path of the output data files.
 * **-\-graph-name**: (**Required**) the name assigned for the graph.
-* **-\-remap-node_id**: boolean value to decide whether to rename node IDs or not. Default is true.
+* **-\-remap-node-id**: boolean value to decide whether to rename node IDs or not. Default is true.
 * **-\-add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Default is true.
 * **-\-output-format**: the format of constructed graph, options are ``DGL``,  ``DistDGL``.  Default is ``DistDGL``. It also accepts multiple graph formats at the same time separated by an space, for example ``--output-format "DGL DistDGL"``. The output format is explained in the :ref:`Output <output-format>` section below.
 * **-\-num-parts**: the number of partitions of the constructed graph. This is only valid if the output format is ``DistDGL``.

diff --git a/docs/source/scale/sagemaker.rst b/docs/source/scale/sagemaker.rst
@@ -184,6 +184,7 @@ Users can use the following command to launch a GraphStorm Link Prediction infer
             --graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
             --yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml \
             --model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL>/ \
+            --raw-node-mappings-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p/raw_id_mappings \
             --output-emb-s3 s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/ \
             --output-prediction-s3 s3://<PATH_TO_SAVE_PREDICTION_RESULTS> \
             --graph-name ogbn-mag \
@@ -196,7 +197,8 @@ Users can use the following command to launch a GraphStorm Link Prediction infer
 
 .. note::
 
-    Diffferent from the training command's argument, in the inference command, the value of the ``--model-artifact-s3`` argument needs to be path to a saved model. By default, it is stored under an S3 path with specific training epoch or epoch plus iteration number, e.g., ``s3://models/epoch-0-iter-999``, where the trained model artifacts were saved.
+    * Different from the training command's argument, in the inference command, the value of the ``--model-artifact-s3`` argument needs to be path to a saved model. By default, it is stored under an S3 path with specific training epoch or epoch plus iteration number, e.g., ``s3://models/epoch-0-iter-999``, where the trained model artifacts were saved.
+    * If ``--raw-node-mappings-s3`` is not provided, it will be default to the ``{graph-data-s3}/raw_id_mappings``. The expected graph mappings files should be ``node_mapping.pt``, ``edge_mapping.pt`` and parquet files under ``raw_id_mappings``. They record the mapping between original node and edge ids in the raw data files and the ids of nodes and edges in the Graph Node ID space. These files are created during graph construction by either GConstruct or GSProcessing.
 
 As the outcomes of the inference command, the generated node embeddings will be uploaded to ``s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/``. For node classification/regression or edge classification/regression tasks, users can use ``--output-prediction-s3`` to specify the saving locations of prediction results.
 

diff --git a/docs/source/tutorials/own-data.rst b/docs/source/tutorials/own-data.rst
@@ -211,25 +211,28 @@ The above command reads in the JSON file, and matchs its contents with the node
 
     /tmp/acm_gs
     acm.json
-    author_id_remap.parquet
-    edge_label_stats.json
     edge_label_stats.json
     edge_mapping.pt
     node_label_stats.json
     node_mapping.pt
-    paper_id_remap.parquet
     |- part0
         edge_feat.dgl
         graph.dgl
         node_feat.dgl
-    subject_id_remap.parquet
+    |- raw_id_mappings
+        |- author
+            part-00000.parquet
+        |- paper
+            part-00000.parquet
+        |- subject
+            part-00000.parquet
 
 Because the above command specifies the ``--num-parts`` to be ``1``, there is only one partition created, which is saved in the ``part0`` folder. These files become the inputs of GraphStorm's launch scripts.
 
 .. note::
 
     - Because the parquet format has some limitations, such as only supporting 2 billion elements in a column, etc, we suggest users to use HDF5 format for very large datasets.
-    - The two mapping files, ``node_mapping.pt`` and ``edge_mapping.pt``, are used to record the mapping between the ogriginal node and edge ids in the raw data files and the ids of nodes and edges in the Graph Node ID space. They are important for mapping the training and inference outputs back to the Raw Node ID space in the original input data. Therefore, **DO NOT** move or delete them.
+    - The mapping files, ``node_mapping.pt``, ``edge_mapping.pt`` and the files under ``raw_id_mappings``, are used to record the mapping between the original node and edge ids in the raw data files and the ids of nodes and edges in the Graph Node ID space. They are important for mapping the training and inference outputs back to the Raw Node ID space in the original input data. Therefore, **DO NOT** move or delete them.
 
 .. _option-2: