Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Small doc fixes #750

Merged
merged 1 commit into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -286,15 +286,15 @@ can contain the following top-level keys:
feature values in the data.
- ``transformation`` (JSON object, optional): The type of
transformation that will be applied to the feature. For details on
the individual transformations supported see :ref:`supported-transformations`.
the individual transformations supported see :ref:`gsp-supported-transformations-ref`.
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
If this key is missing, the feature is treated as
a **no-op** feature without ``kwargs``.

- ``name`` (String, required): The name of the transformation to be
applied.
- ``kwargs`` (JSON object, optional): A dictionary of parameter
names and values. Each individual transformation will have its own
supported parameters, described in :ref:`supported-transformations`.
supported parameters, described in :ref:`gsp-supported-transformations-ref`.

- ``name`` (String, optional): The name that will be given to the
encoded feature. If not given, **column** is used as the output name.
Expand Down Expand Up @@ -470,7 +470,7 @@ arguments.
You can find all models in the `Huggingface model repository <https://huggingface.co/models>`_.
- ``max_seq_length`` (Integer, required): Specifies the maximum number of tokens of the input.
You can use a length greater than the dataset's longest sentence; or for a safe value choose 128. Make sure to check
the model's max suported length when setting this value,
the model's max suported length when setting this value,

--------------

Expand Down
6 changes: 3 additions & 3 deletions docs/source/gs-processing/usage/example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,17 @@ that contains the relevant data:
Expected file inputs and configuration
--------------------------------------

GSProcessing expects the input files to be in specific format that will allow
GSProcessing expects the input files to be in a specific format that will allow
us to perform the processing and prepare the data for partitioning and training.

The data files are expected to be:

* Tabular data files. We support CSV-with-header format, or in Parquet format.
The files can be split (multiple parts), or a single file.
* Available on a local file system or on S3.
* One tabular file source per edge and node type. For example, for a particular edge
* One prefix per edge and node type. For example, for a particular edge
type, all node identifiers (source, destination), features, and labels should
exist as columns in a single file source.
exist as columns in one or more files under a common prefix (local or on S3).

Apart from the data, GSProcessing also requires a configuration file that describes the
data and the transformations we will need to apply to the features and any encoding needed for
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ def __init__(
"graph"
]
else:
logging.warning("Unrecognized version name: %s", config_version)
logging.warning("Unrecognized configuration file version name: %s", config_version)
try:
converter = GConstructConfigConverter()
self.graph_config_dict = converter.convert_to_gsprocessing(dataset_config_dict)[
Expand All @@ -192,8 +192,10 @@ def __init__(
"graph" in dataset_config_dict
), "Top-level element 'graph' needs to exist in a GSProcessing config"
self.graph_config_dict = dataset_config_dict["graph"]
logging.info("Parsed config file as GSProcessing config")
else:
# Older versions of GConstruct configs might be missing a version entry
logging.warning("No configuration file version name, trying to parse as GConstruct...")
converter = GConstructConfigConverter()
self.graph_config_dict = converter.convert_to_gsprocessing(dataset_config_dict)["graph"]

Expand Down Expand Up @@ -263,7 +265,7 @@ def parse_args() -> argparse.Namespace:
parser.add_argument(
"--config-filename",
type=str,
help="GSProcessing data configuration filename.",
help="GConstruct or GSProcessing data configuration filename.",
required=True,
)
parser.add_argument(
Expand Down Expand Up @@ -309,9 +311,12 @@ def main():
is_sagemaker_execution = os.path.exists("/opt/ml/config/processingjobconfig.json")

if gsprocessing_args.input_prefix.startswith("s3://"):
assert gsprocessing_args.output_prefix.startswith(
"s3://"
), "When providing S3 input and output prefixes, they must both be S3."
assert gsprocessing_args.output_prefix.startswith("s3://"), (
"When providing S3 input and output prefixes, they must both be S3 URIs, got: "
f"input: '{gsprocessing_args.input_prefix}' "
f"and output: '{gsprocessing_args.output_prefix}'."
)

filesystem_type = "s3"
else:
# Ensure input and output prefixes exist and convert to absolute paths
Expand Down
Loading