Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Add a strict naming rule for graph names #1044

Merged
merged 10 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Full argument list of the ``gconstruct.construct_graph`` command
* **-\-num-processes-for-nodes**: the number of processes to process node data simultaneously. Increase this number can speed up node data processing.
* **-\-num-processes-for-edges**: the number of processes to process edge data simultaneously. Increase this number can speed up edge data processing.
* **-\-output-dir**: (**Required**) the path of the output data files.
* **-\-graph-name**: (**Required**) the name assigned for the graph.
* **-\-graph-name**: (**Required**) the name assigned for the graph. The graph name must adhere to the `Python identifier naming rules<https://docs.python.org/3/reference/lexical_analysis.html#identifiers>`_ with the exception that hyphens (``-``) are permitted and the name can start with numbers.
classicsong marked this conversation as resolved.
Show resolved Hide resolved
* **-\-remap-node-id**: boolean value to decide whether to rename node IDs or not. Adding this argument will set it to be true, otherwise false.
* **-\-add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Adding this argument sets it to true; otherwise, it defaults to false. It is **strongly** suggested to include this argument for graph construction, as some nodes in the original data may not have in-degrees, and thus cannot update their presentations by aggregating messages from their neighbors. Adding this arugment helps prevent this issue.
* **-\-output-format**: the format of constructed graph, options are ``DGL``, ``DistDGL``. Default is ``DistDGL``. It also accepts multiple graph formats at the same time separated by an space, for example ``--output-format "DGL DistDGL"``. The output format is explained in the :ref:`Output <gcon-output-format>` section above.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -388,7 +388,7 @@ The rest of the arguments are passed on to ``sagemaker_train.py`` or ``sagemaker

* **--task-type**: Task type.
* **--graph-data-s3**: S3 location of the input graph.
* **--graph-name**: Name of the input graph.
* **--graph-name**: Name of the input graph. The graph name must adhere to the `Python identifier naming rules<https://docs.python.org/3/reference/lexical_analysis.html#identifiers>`_ with the exception that hyphens (``-``) are permitted and the name can start with numbers.
classicsong marked this conversation as resolved.
Show resolved Hide resolved
* **--yaml-s3**: S3 location of yaml file for training and inference.
* **--custom-script**: Custom training script provided by customers to run customer training logic. This should be a path to the Python script within the Docker image.
* **--output-emb-s3**: S3 location to store GraphStorm generated node embeddings. This is an inference only argument.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@
import json
import logging
import os
import re
from pathlib import Path
import tempfile
import time
Expand Down Expand Up @@ -540,7 +541,11 @@ def parse_args() -> argparse.Namespace:
parser.add_argument(
"--graph-name",
type=str,
help="Name for the graph being processed.",
help="Name for the graph being processed."
"The graph name must adhere to the Python "
"identifier naming rules with the exception "
"that hyphens (-) are permitted and the name "
"can start with numbers",
required=False,
default=None,
)
Expand All @@ -564,6 +569,33 @@ def parse_args() -> argparse.Namespace:
return parser.parse_args()


def check_graph_name(graph_name):
"""Check whether the graph name is a valid graph name.

We enforce that the graph name adheres to the Python
identifier naming rules as in
https://docs.python.org/3/reference/lexical_analysis.html#identifiers,
with the exception that hyphens (-) are permitted
and the name can start with numbers.
This helps avoid the cases when an invalid graph name,
such as `/graph`, causes unexpected errors.

Note: Same as graphstorm.utils.check_graph_name.

Parameter
---------
graph_name: str
Graph Name.
"""
gname = re.sub(r"^\d+", "", graph_name)
assert gname.replace("-", "_").isidentifier(), (
"GraphStorm expects the graph name adheres to the Python"
"identifier naming rules with the exception that hyphens "
"(-) are permitted and the name can start with numbers. "
f"Got: {graph_name}"
)


def main():
"""Main entry point for GSProcessing"""
# Allows us to get typed arguments from the command line
Expand All @@ -572,6 +604,7 @@ def main():
level=gsprocessing_args.log_level,
format="[GSPROCESSING] %(asctime)s %(levelname)-8s %(message)s",
)
check_graph_name(gsprocessing_args.graph_name)

# Determine execution environment
if os.path.exists("/opt/ml/config/processingjobconfig.json"):
Expand Down
9 changes: 7 additions & 2 deletions python/graphstorm/gconstruct/construct_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
import dgl
from dgl.distributed.constants import DEFAULT_NTYPE, DEFAULT_ETYPE

from ..utils import sys_tracker, get_log_level
from ..utils import sys_tracker, get_log_level, check_graph_name
from .file_io import parse_node_file_format, parse_edge_file_format
from .file_io import get_in_files
from .transform import parse_feat_ops, process_features, preprocess_features
Expand Down Expand Up @@ -742,6 +742,7 @@ def print_graph_info(g, node_data, edge_data, node_label_stats, edge_label_stats
def process_graph(args):
""" Process the graph.
"""
check_graph_name(args.graph_name)
logging.basicConfig(level=get_log_level(args.logging_level))
with open(args.conf_file, 'r', encoding="utf8") as json_file:
process_confs = json.load(json_file)
Expand Down Expand Up @@ -909,7 +910,11 @@ def process_graph(args):
argparser.add_argument("--output-dir", type=str, required=True,
help="The path of the output data folder.")
argparser.add_argument("--graph-name", type=str, required=True,
help="The graph name")
help="Name for the graph being processed."
"The graph name must adhere to the Python "
"identifier naming rules with the exception "
"that hyphens (-) are permitted "
"and the name can start with numbers",)
argparser.add_argument("--remap-node-id", action='store_true',
help="Whether or not to remap node IDs.")
argparser.add_argument("--add-reverse-edges", action='store_true',
Expand Down
28 changes: 27 additions & 1 deletion python/graphstorm/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import time
import resource
import logging
import re
import psutil

import pandas as pd
Expand All @@ -31,6 +32,29 @@
USE_WHOLEGRAPH = False
GS_DEVICE = th.device('cpu')

def check_graph_name(graph_name):
""" Check whether the graph name is a valid graph name.

We enforce that the graph name adheres to the Python
identifier naming rules as in
https://docs.python.org/3/reference/lexical_analysis.html#identifiers,
with the exception that hyphens (-) are permitted
and the name can start with numbers.
This helps avoid the cases when an invalid graph name,
such as `/graph`, causes unexpected errors.

Parameter
---------
graph_name: str
Graph Name.
"""
gname = re.sub(r'^\d+', '', graph_name)
assert gname.replace('-', '_').isidentifier(), \
"GraphStorm expects the graph name adheres to the Python" \
"identifier naming rules with the exception that hyphens " \
"(-) are permitted and the name can start with numbers. " \
f"Got: {graph_name}."

def get_graph_name(part_config):
""" Get graph name from graph partition config file

Expand All @@ -45,7 +69,9 @@ def get_graph_name(part_config):
"""
with open(part_config, "r", encoding='utf-8') as f:
config = json.load(f)
return config["graph_name"]
graph_name = config["graph_name"]
check_graph_name(graph_name)
return graph_name

def setup_device(local_rank):
r"""Setup computation device.
Expand Down
28 changes: 28 additions & 0 deletions tests/unit-tests/test_gsf.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@

Unit tests for gsf.py
"""
import pytest

from graphstorm.gsf import (create_builtin_node_decoder,
create_builtin_edge_decoder,
create_builtin_lp_decoder)
from graphstorm.utils import check_graph_name
from graphstorm.config import (BUILTIN_TASK_NODE_CLASSIFICATION,
BUILTIN_TASK_NODE_REGRESSION,
BUILTIN_TASK_EDGE_CLASSIFICATION,
Expand Down Expand Up @@ -449,7 +451,33 @@ def test_create_builtin_lp_decoder():
assert decoder.gamma == 6.


def test_check_graph_name():
graph_name = "a"
check_graph_name(graph_name)
graph_name = "graph_name"
check_graph_name(graph_name)
graph_name = "graph-name"
check_graph_name(graph_name)
graph_name = "123-graph-name"
check_graph_name(graph_name)
graph_name = "_Graph-name"
check_graph_name(graph_name)

# test with invalid graph name
graph_name = "/graph_name"
with pytest.raises(AssertionError):
check_graph_name(graph_name)

graph_name = "|graph_name"
with pytest.raises(AssertionError):
check_graph_name(graph_name)

graph_name = "\graph_name"
with pytest.raises(AssertionError):
check_graph_name(graph_name)

if __name__ == '__main__':
test_check_graph_name()
test_create_builtin_node_decoder()
test_create_builtin_edge_decoder()
test_create_builtin_lp_decoder()
Loading