diff --git a/graphstorm-processing/docs/source/developer/developer-guide.rst b/graphstorm-processing/docs/source/developer/developer-guide.rst index 45d2e3ecf0..1a7faf85db 100644 --- a/graphstorm-processing/docs/source/developer/developer-guide.rst +++ b/graphstorm-processing/docs/source/developer/developer-guide.rst @@ -6,8 +6,8 @@ jump into the project. The steps we recommend are: -Install JDK 8, 11 or 17 -~~~~~~~~~~~~~~~~~~~~~~~ +Install JDK 8, 11 +~~~~~~~~~~~~~~~~~ PySpark requires a compatible Java installation to run, so you will need to ensure your active JDK is using either @@ -33,7 +33,7 @@ On Amazon Linux 2 you can use: sudo yum install java-11-amazon-corretto-headless sudo yum install java-11-amazon-corretto-devel -Install pyenv +Install ``pyenv`` ~~~~~~~~~~~~~ ``pyenv`` is a tool to manage multiple Python version installations. It @@ -50,13 +50,14 @@ or use ``brew`` on a Mac: brew update brew install pyenv -For more info on ``pyenv`` see https://github.com/pyenv/pyenv +For more info on ``pyenv`` see `its documentation. ` Create a Python 3.9 env and activate it. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We use Python 3.9 in our images so this most closely resembles the -execution environment on SageMaker. +execution environment on our Docker images that will be used for distributed +training. .. code-block:: bash @@ -65,12 +66,12 @@ execution environment on SageMaker. .. - Note: We recommend not mixing up conda and pyenv. When developing for + Note: We recommend not mixing up ``conda`` and ``pyenv``. When developing for this project, simply ``conda deactivate`` until there's no ``conda`` - env active (even ``base``) and just rely on pyenv+poetry to handle + env active (even ``base``) and just rely on ``pyenv`` and ``poetry`` to handle dependencies. -Install poetry +Install ``poetry`` ~~~~~~~~~~~~~~ ``poetry`` is a dependency and build management system for Python. To install it @@ -80,7 +81,7 @@ use: curl -sSL https://install.python-poetry.org | python3 - -Install dependencies through poetry +Install dependencies through ``poetry`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now we are ready to install our dependencies through ``poetry``. @@ -111,13 +112,12 @@ You can also activate and use the virtual environment using: # We're now using the graphstorm-processing-py3.9 env so we can just run pytest ./graphstorm-processing/tests -To learn more about poetry see: -https://python-poetry.org/docs/basic-usage/ +To learn more about ``poetry`` see its `documentation `_ -Use ``black`` to format code -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Use ``black`` to format code [optional] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We use `black `__ to +We use `black `_ to format code in this project. ``black`` is an opinionated formatter that helps speed up development and code reviews. It is included in our ``dev`` dependencies so it will be installed along with the other dev @@ -148,10 +148,8 @@ We include the ``mypy`` and ``pylint`` linters as a dependency under the ``dev`` of dependencies. These linters perform static checks on your code and can be used in a complimentary manner. -We recommend using VSCode and enabling the mypy linter to get in-editor -annotations: - -https://code.visualstudio.com/docs/python/linting#_general-settings +We recommend `using VSCode and enabling the mypy linter `_ +to get in-editor annotations. You can also lint the project code through: @@ -159,8 +157,9 @@ You can also lint the project code through: poetry run mypy ./graphstorm_processing -To learn more about ``mypy`` and how it can help development see: -https://mypy.readthedocs.io/en/stable/ +To learn more about ``mypy`` and how it can help development +`see its documentation `_. + Our goal is to minimize ``mypy`` errors as much as possible for the project. New code should be linted and not introduce additional mypy @@ -169,17 +168,21 @@ errors. When necessary it's OK to use ``type: ignore`` to silence As a project, GraphStorm requires a 10/10 pylint score, so ensure your code conforms to the expectation by running -`pylint --rcfile=/path/to/graphstorm/tests/lint/pylintrc` . + +.. code-block:: bash + + pylint --rcfile=/path/to/graphstorm/tests/lint/pylintrc + on your code before commits. To make this easier we include a pre-commit hook below. -Use a pre-commit hook to ensure black and pylint runs before commits +Use a pre-commit hook to ensure ``black`` and ``pylint`` runs before commits ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -To make code formatting and pylint checks easier for graphstorm-processing -developers we recommend using a pre-commit hook. +To make code formatting and ``pylint`` checks easier for graphstorm-processing +developers, we recommend using a pre-commit hook. -We include ``pre-commit`` in the project’s ``dev`` dependencies, so once +We include ``pre-commit`` in the project's ``dev`` dependencies, so once you have activated the project's venv (``poetry shell``) you can just create a file named ``.pre-commit-config.yaml`` with the following contents: @@ -213,15 +216,15 @@ And then run: pre-commit install -which will install the ``black`` hook into your local repository and +which will install the ``black`` and ``pylin`` hooks into your local repository and ensure it runs before every commit. -.. note:: text +.. note:: The pre-commit hook will also apply to all commits you make to the root - GraphStorm repository. Since that one doesn't use ``black``, you might + GraphStorm repository. Since that Graphstorm doesn't use ``black``, you might want to remove the hooks. You can do so from the root repo using ``rm -rf .git/hooks``. - Both project use ``pylint`` to check Python files so we'd still recommend using + Both projects use ``pylint`` to check Python files so we'd still recommend using that hook even if you're doing development for both GSProcessing and GraphStorm. diff --git a/graphstorm-processing/docs/source/developer/input-configuration.rst b/graphstorm-processing/docs/source/developer/input-configuration.rst index 3f42c1a694..e6e2d7ae98 100644 --- a/graphstorm-processing/docs/source/developer/input-configuration.rst +++ b/graphstorm-processing/docs/source/developer/input-configuration.rst @@ -7,14 +7,18 @@ GraphStorm Processing uses a JSON configuration file to parse and process the data into the format needed by GraphStorm partitioning and training downstream. -We provide scripts that can convert a ``GConstruct`` -input configuration file into one compatible with -GraphStorm Processing so users with existing -``GConstruct`` files can make use of the distributed -processing capabilities of GraphStorm Processing -to scale up their graph processing. +We use this configuration format as an intermediate +between other config formats, such as the one used +by the single-machine GConstruct module. -The input data configuration has two top-level nodes: +GSProcessing can take a GConstruct-formatted file +directly, and we also provide `a script ` +that can convert a `GConstruct ` +input configuration file into the ``GSProcessing`` format, +although this is mostly aimed at developers, users are +can rely on the automatic conversion. + +The GSProcessing input data configuration has two top-level objects: .. code-block:: json @@ -26,20 +30,20 @@ The input data configuration has two top-level nodes: - ``version`` (String, required): The version of configuration file being used. We include the package name to allow self-contained identification of the file format. - ``graph`` (JSON object, required): one configuration object that defines each - of the nodes and edges that describe the graph. + of the node types and edge types that describe the graph. We describe the ``graph`` object next. ``graph`` configuration object ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The ``graph`` configuration object can have two top-level nodes: +The ``graph`` configuration object can have two top-level objects: .. code-block:: json { - "edges": [...], - "nodes": [...] + "edges": [{}], + "nodes": [{}] } - ``edges``: (array of JSON objects, required). Each JSON object @@ -55,43 +59,41 @@ Contents of an ``edges`` configuration object ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ An ``edges`` configuration object can contain the following top-level -nodes: +objects: .. code-block:: json { "data": { - "format": "parquet" or "csv", - "files": [>], - "separator": String + "format": "String", + "files": ["String"], + "separator": "String" }, - "source" : {"column": String, "type": String}, - "relation" : {"column": String, "type": String}, - "destination" : ["column": String, "type": String], + "source": {"column": "String", "type": "String"}, + "relation": {"type": "String"}, + "destination": {"column": "String", "type": "String"}, "labels" : [ - { - "column": String, "type": String, - "split_rate": { - "train": Float, - "val": Float, - "test": Float - } - }, - ...] - "features": [{feature_object}, ...] + { + "column": "String", + "type": "String", + "split_rate": { + "train": "Float", + "val": "Float", + "test": "Float" + } + }, + ] + "features": [{}] } - ``data`` (JSON Object, required): Describes the physical files that store the data described in this object. The JSON object has two - top level nodes: + top level objects: - ``format`` (String, required): indicates the format the data is stored in. We accept either ``"csv"`` or ``"parquet"`` as valid file formats. - - We will add support for JSON input as P1 feature at a later - point. - - ``files`` (array of String, required): the physical location of files. The format accepts two options: @@ -101,7 +103,7 @@ nodes: - e.g. ``"files": ['path/to/edge/type/']`` - This option allows for concise listing of entire types and - would be preferred. All the files under the path will be loaded. + would be preferred. All the files under the path will be loaded. - a multi-element list of **relative** file paths. @@ -134,27 +136,16 @@ nodes: ``{“column: String, and ”type“: String}``. - ``relation``: (JSON object, required): Describes the relation modeled by the edges. A relation can be common among all edges, or it - can have sub-types. The top-level keys for the object are: + can have sub-types. The top-level objects for the object are: - ``type`` (String, required): The type of the relation described by the edges. For example, for a source type ``user``, destination ``movie`` we can have a relation type ``interacted_with`` for an edge type ``user:interacted_with:movie``. - - ``column`` (String, optional): If present this column determines - the type of sub-relation described by the edge, breaking up the - edge type into further sub-types. - - - For - ``"type": "interacted_with", "column": "interaction_kind"``, we - might have the values ``watched``, ``rated``, ``shared`` in the - ``interaction_kind`` column, leading to fully qualified edge - types: ``user:interacted_with-watched:movie``, - ``user:interacted_with-rated:movie, user:interacted_with-shared:movie`` - . - ``labels`` (List of JSON objects, optional): Describes the label - for the current edge type. The label object is has the following - top-level keys: + for the current edge type. The label object has the following + top-level objects: - ``column`` (String, required): The column that contains the values for the label. Should be the empty string, ``""`` if the ``type`` @@ -174,7 +165,8 @@ nodes: tasks, this separator is used within the column to list multiple classification labels in one entry. - ``split_rate`` (JSON object, optional): Defines a split rate - for the label items + for the label items. The sum of the values for ``train``, ``val`` and + ``test`` needs to be 1.0. - ``train``: The percentage of the data with available labels to assign to the train set (0.0, 1.0]. @@ -184,8 +176,7 @@ nodes: assign to the train set [0.0, 1.0). - ``features`` (List of JSON objects, optional)\ **:** Describes - the set of features for the current edge type. See the **Contents of - a ``features`` configuration object** section for details. + the set of features for the current edge type. See the :ref:`features-object` section for details. -------------- @@ -197,26 +188,28 @@ following top-level keys: .. code-block:: json - { - "data": { - "format": "parquet" or "csv", - "files": [String], - "separator": String - }, - "column" : String, - "type" : String, - "labels" : [{ - "column": String, - "type": String, - "separator": String, - "split_rate": { - "train": Float, - "val": Float, - "test": Float - },...] - }, - "features": [{feature_object}, ...] - } + { + "data": { + "format": "String", + "files": ["String"], + "separator": "String" + }, + "column" : "String", + "type" : "String", + "labels" : [ + { + "column": "String", + "type": "String", + "separator": "String", + "split_rate": { + "train": "Float", + "val": "Float", + "test": "Float" + } + } + ], + "features": [{}] + } - ``data``: (JSON object, required): Has the same definition as for the edges object, with one top-level key for the ``format`` that @@ -236,18 +229,21 @@ following top-level keys: - ``type`` (String, required): Specifies that target task type which can be: - - ``"classification"``: A node classification task. The values in the specified ``column`` as treated as categorical variables. - - ``"regression"``: A node regression task. + - ``"classification"``: A node classification task. The values in the specified + ``column`` are treated as categorical variables. + - ``"regression"``: A node regression task. The values in the specified + ``column`` are treated as float values. - ``separator`` (String, optional): For multi-label classification tasks, this separator is used within the column to list multiple classification labels in one entry. - - e.g. with separator ``|`` we can have ``action|comedy`` as a + - e.g. with separator ``|`` we can have ``action|comedy`` as a label value. - ``split_rate`` (JSON object, optional): Defines a split rate - for the label items + for the label items. The sum of the values for ``train``, ``val`` and + ``test`` needs to be 1.0. - ``train``: The percentage of the data with available labels to assign to the train set (0.0, 1.0]. @@ -257,11 +253,13 @@ following top-level keys: assign to the train set [0.0, 1.0). - ``features`` (List of JSON objects, optional): Describes - the set of features for the current edge type. See the **Contents of - a ``features`` configuration object** section for details. + the set of features for the current edge type. See the next section, :ref:`features-object` + for details. -------------- +.. _features-object: + Contents of a ``features`` configuration object ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -270,46 +268,88 @@ can contain the following top-level keys: .. code-block:: json - { - "column": String, - "name": String, - "transformation": { - "name": String, - "kwargs": { - "arg_name": "", - [...] - } - }, - "data": { - "format": "parquet" or "csv", - "files": [>], - "separator": String - } - } + { + "column": "String", + "name": "String", + "transformation": { + "name": "String", + "kwargs": { + "arg_name": "" + } + }, + "data": { + "format": "String", + "files": ["String"], + "separator": "String" + } + } - ``column`` (String, required): The column that contains the raw feature values in the dataset - ``transformation`` (JSON object, optional): The type of transformation that will be applied to the feature. For details on - the individual transformations supported see the Section **Supported - transformations.** If this key is missing, the feature is treated as + the individual transformations supported see :ref:`supported-transformations`. + If this key is missing, the feature is treated as a **no-op** feature without ``kwargs``. - ``name`` (String, required): The name of the transformation to be applied. - ``kwargs`` (JSON object, optional): A dictionary of parameter names and values. Each individual transformation will have its own - supported parameters, described in **Supported transformations.** + supported parameters, described in :ref:`supported-transformations`. - ``name`` (String, optional): The name that will be given to the encoded feature. If not given, **column** is used as the output name. - ``data`` (JSON object, optional): If the data for the feature exist in a file source that's different from the rest of the data of - the node/edge type, they are provided here. **The file source needs + the node/edge type, they are provided here. For example, you could + have each feature in one file source each: + + .. code-block:: python + + # Example node config with multiple features + { + # This is where the node structure data exist just need an id col + "data": { + "format": "parquet", + "files": ["path/to/node_ids"] + }, + "column" : "node_id", + "type" : "my_node_type", + "features": [ + # Feature 1 + { + "column": "feature_one", + # The files contain one "node_id" col and one "feature_one" col + "data": { + "format": "parquet", + "files": ["path/to/feature_one/"] + } + }, + # Feature 2 + { + "column": "feature_two", + # The files contain one "node_id" col and one "feature_two" col + "data": { + "format": "parquet", + "files": ["path/to/feature_two/"] + } + } + ] + } + + + **The file source needs to contain the column names of the parent node/edge type to allow a - 1-1 mapping between the structure and feature files.** For nodes the - node_id column suffices, for edges we need both the source and - destination columns to use as a composite key. + 1-1 mapping between the structure and feature files.** + + For nodes the + the feature files need to have one column named with the node id column + name, (the value of ``"column"`` for the parent node type), + for edges we need both the ``source`` and + ``destination`` columns to use as a composite key. + +.. _supported-transformations: Supported transformations ~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -344,47 +384,47 @@ OAG-Paper dataset .. code-block:: json - { - "version" : "gsprocessing-v1.0", - "graph" : { - "edges" : [ - { - "data": { - "format": "csv", - "files": [ - "edges.csv" - ] - }, - "separator": ",", - "source": {"column": "~from", "type": "paper"}, - "dest": {"column": "~to", "type": "paper"}, - "relation": {"type": "cites"} - } - ], - "nodes" : [ - { - "data": { - "format": "csv", - "files": [ - "node_feat.csv" - ] - }, - "separator": ",", - "type": "paper", - "column": "ID", - "labels": [ - { - "column": "field", - "type": "classification", - "separator": ";", - "split_rate": { - "train": 0.7, - "val": 0.1, - "test": 0.2 - } - } - ] - } - ] - } - } + { + "version" : "gsprocessing-v1.0", + "graph" : { + "edges" : [ + { + "data": { + "format": "csv", + "files": [ + "edges.csv" + ], + "separator": "," + }, + "source": {"column": "~from", "type": "paper"}, + "dest": {"column": "~to", "type": "paper"}, + "relation": {"type": "cites"} + } + ], + "nodes" : [ + { + "data": { + "format": "csv", + "separator": ",", + "files": [ + "node_feat.csv" + ] + }, + "type": "paper", + "column": "ID", + "labels": [ + { + "column": "field", + "type": "classification", + "separator": ";", + "split_rate": { + "train": 0.7, + "val": 0.1, + "test": 0.2 + } + } + ] + } + ] + } + } diff --git a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst b/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst index 56253fd842..53fe61c922 100644 --- a/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst +++ b/graphstorm-processing/docs/source/usage/amazon-sagemaker.rst @@ -6,7 +6,7 @@ use the Amazon SageMaker launch scripts to launch distributed processing jobs that use AWS resources. To demonstrate the usage of GSProcessing on Amazon SageMaker, we will execute the same job we used in our local -execution example, but this time use Amazon SageMaker to provide the compute instead of our +execution example, but this time use Amazon SageMaker to provide the compute resources instead of our local machine. Upload data to S3 @@ -54,7 +54,7 @@ using larger instances like `ml.r5.24xlarge`. Since we're now executing on AWS, we'll need access to an execution role for SageMaker and the ECR image URI we created in :doc:`/usage/distributed-processing-setup`. For instructions on how to create an execution role for SageMaker -see the `AWS SageMaker documentation `. +see the `AWS SageMaker documentation `_. Let's set up a small bash script that will run the parametrized processing job, followed by the re-partitioning job, both on SageMaker @@ -113,6 +113,19 @@ job, followed by the re-partitioning job, both on SageMaker want to scale up to an instance with more memory to avoid memory errors. `ml.r5` instances should allow you to re-partition graph data with billions of nodes and edges. +The ``--num-output-files`` parameter +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can see that we provided a parameter named +``--num-output-files`` to ``run_distributed_processing.py``. This is an +important parameter, as it provides a hint to set the parallelism for Spark. + +It can safely be skipped and let Spark decide the proper value based on the cluster's +instance type and count. If setting it yourself a good value to use is +``num_instances * num_cores_per_instance * 2``, which will ensure good +utilization of the cluster resources. + + Examine the output ------------------ diff --git a/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst b/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst index caf4606965..785dd5a514 100644 --- a/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst +++ b/graphstorm-processing/docs/source/usage/distributed-processing-setup.rst @@ -40,8 +40,8 @@ To get started with building the GraphStorm Processing image you'll need to have the Docker engine installed. -To install Docker follow the instructions at: -https://docs.docker.com/engine/install/ +To install Docker follow the instructions at the +`official site `_. Install Poetry -------------- @@ -56,8 +56,8 @@ You can install Poetry using: curl -sSL https://install.python-poetry.org | python3 - -For detailed installation instructions see: -https://python-poetry.org/docs/ +For detailed installation instructions the +`Poetry docs `_. Set up AWS access @@ -74,8 +74,8 @@ To install the AWS CLI you can use: unzip awscliv2.zip sudo ./aws/install -To set up credentials for use with ``aws-cli`` see: -https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-examples +To set up credentials for use with ``aws-cli`` see the +`AWS docs `_. Your role should have full ECR access to be able to pull from ECR to build the image, create an ECR repository if it doesn't exist, and push the GSProcessing image to the repository. diff --git a/graphstorm-processing/docs/source/usage/example.rst b/graphstorm-processing/docs/source/usage/example.rst index 89135d7b1a..bc53d279ae 100644 --- a/graphstorm-processing/docs/source/usage/example.rst +++ b/graphstorm-processing/docs/source/usage/example.rst @@ -37,16 +37,18 @@ us to perform the processing and prepare the data for partitioning and training. The data files are expected to be: -* Tabular data files. We support CSV with header format, or in Parquet format. - The files can be partitioned (multiple parts), or a single file. -* Available on a local filesystem or on S3. +* Tabular data files. We support CSV-with-header format, or in Parquet format. + The files can be split (multiple parts), or a single file. +* Available on a local file system or on S3. * One tabular file source per edge and node type. For example, for a particular edge type, all node identifiers (source, destination), features, and labels should exist as columns in a single file source. Apart from the data, GSProcessing also requires a configuration file that describes the -data and the transformations we will need to apply to the features and labels. We support -both the GConstruct configuration format, and the library's own GSProcessing format. +data and the transformations we will need to apply to the features and any encoding needed for +labels. +We support both the `GConstruct configuration format `_ +, and the library's own GSProcessing format, described in :doc:`/developer/input-configuration`. .. note:: We expect end users to only provide a GConstruct configuration file, @@ -85,18 +87,20 @@ For example: The contents of the ``gconstruct-config.json`` can be: -.. code-block:: json +.. code-block:: python { - "edges" : [ - { - "files": ["edges/movie-included_in-genre.csv"], - "format": { - "name": "csv", - "separator" : "," + "edges" : [ + { + # Note that the file is a relative path + "files": ["edges/movie-included_in-genre.csv"], + "format": { + "name": "csv", + "separator" : "," + } + # [...] Other edge config values } - } - ] + ] } Given the above we can run a job with local input data as: @@ -104,7 +108,7 @@ Given the above we can run a job with local input data as: .. code-block:: bash > gs-processing --input-data /home/path/to/data \ - --config-filename gsprocessing-config.json + --config-filename gconstruct-config.json The benefit with using relative paths is that we can move the same files to any location, including S3, and run the same job without making changes to the config @@ -116,13 +120,13 @@ file: > mv /home/path/to/data /home/new-path/to/data # After moving all the files we can still use the same config > gs-processing --input-data /home/new-path/to/data \ - --config-filename gsprocessing-config.json + --config-filename gconstruct-config.json # Upload data to S3 > aws s3 sync /home/new-path/to/data s3://my-bucket/data/ # We can still use the same config, just change the prefix to an S3 path > python run_distributed_processing.py --input-data s3://my-bucket/data \ - --config-filename gsprocessing-config.json + --config-filename gconstruct-config.json Node files are optional ^^^^^^^^^^^^^^^^^^^^^^^ @@ -131,7 +135,7 @@ GSProcessing does not require node files to be provided for every node type. If a node type appears in one of the edges, its unique node identifiers will be determined by the edge files. -In the example GConstruct above, the node ids for the node types +In the example GConstruct file above (`gconstruct-config.json`), the node ids for the node types ``movie`` and ``genre`` will be extracted from the edge list provided. Example data and configuration @@ -151,7 +155,7 @@ and one label, ``gender``, that we transform to prepare the data for a node clas Run a GSProcessing job locally ------------------------------ -While GSProcessing is designed to run on distributed clusters on Amazon SageMaker, +While GSProcessing is designed to run on distributed clusters, we can also run small jobs in a local environment, using a local Spark instance. To do so, we will be using the ``gs-processing`` entry point, @@ -187,7 +191,9 @@ Examining the job output ------------------------ Once the processing and re-partitioning jobs are done, -we can examine the outputs they created. +we can examine the outputs they created. The output will be +compatible with the `Chunked Graph Format of DistDGL `_ +and can be used downstream to create a partitioned graph. .. code-block:: bash @@ -216,23 +222,36 @@ the graph structure, features, and labels. In more detail: The directories created contain: * ``edges``: Contains the edge structures, one sub-directory per edge - type. + type. Each edge file will contain two columns, the source and destination + `numerical` node id, named ``src_int_id`` and ``dist_int_id`` respectively. * ``node_data``: Contains the features for the nodes, one sub-directory - per node type. + per node type. Each file will contain one column named after the original + feature name that contains the value of the feature (could be a scalar or a vector). * ``node_id_mappings``: Contains mappings from the original node ids to the ones created by the processing job. This mapping would allow you to trace - back predictions to the original nodes/edges. + back predictions to the original nodes/edges. The files will have two columns, + ``node_str_id`` that contains the original string ID of the node, and ``node_int_id`` + that contains the numerical id that the string id was mapped to. If the graph had included edge features they would appear in an ``edge_data`` directory. +.. note:: + + It's important to note that files for edges and edge data will have the + same order and row counts per file, as expected by DistDGL. Similarly, + all node feature files will have the same order and row counts, where + the first row corresponds to the feature value for node id 0, the second + for node id 1 etc. + + At this point you can use the DGL distributed partitioning pipeline to partition your data, as described in the `DGL documentation `_ To simplify the process of partitioning and training, without the need to manage your own infrastructure, we recommend using GraphStorm's -`SageMaker wrappers `_ +`SageMaker wrappers `_ that do all the hard work for you and allow you to focus on model development.