Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[GSProcessing] Add structure for saving transformation JSON files. (#857
) ## Description of changes: * This commit only adds saving the transformations in a JSON representation. * Loading and applying the pre-computed transformation will come in an upcoming PR. * First implemented for categorical transformation. Detailed changes: To support the saving and loading of pre-computed transformations, we use the following design. * A new JSON output is created in the output that includes JSON representations of the transformations for which we have implemented a representation. * The base class `DistributedTransformation` now takes an _optional_ SparkSession and and _optional_ `json_representation` `dict` during initialization. * We add a `get_json_representation()` function to `DistributedTransformation` which returns the JSON representation of the transformation if it's not None, or an empty dict otherwise. The assumed contract is that each individual transformation must populate its `json_representation` dict during the call to `apply()` which applies the transformation for the first time. * The `DistFeatureTransformer` now also takes a SparkSession and a json_representation `dict` in its constructor. The Spark session is currently only passed to the constructor of `DistCategoryTransformation`. We need it to create Spark DataFrames during `apply()`. * The `apply_transformation`function of `DistFeatureTransformer` now returns a tuple, `(processed_df: DataFrame, json_representation: dict)`. The JSON representation is retrieved using the `get_json_representation()` function of each distributed transformation implementation. Currently only the categorical transformation will return a non-empty dict. * We add a `transformation_representations` dict as a member of `DistHeterogeneousGraphLoader`. These dict is used to gather the representations of each feature as we iterate through them. The structure is: ```python { "node_features": { "node_type1": { "feature_name1": { # feature representation goes here } }, "node_type2": {...} }, "edges_features": {...} } ``` * At the end of graph loading this dict is saved to storage under the output prefix as `precomputed_transformations.json`. This file will be used to re-construct the feature transformations in an upcoming PR. Particularly for the `DistCategoryTransformation`: * We choose to save its representation as a dict with the following structure: ``` string_indexer_labels_array: tuple[tuple[str]], outer tuple has num_cols elements, each inner tuple has num_cats elements, each str is a category string. Spark uses this to represent the one-hot index for each category, its position in the inner tuple is the one-hot-index position for the string. Categories are sorted by their frequency in the data. cols: list[str], with num_cols elements, each item is one column name that was used in the original transformation. per_col_label_to_one_hot_idx: dict[str, dict[str, int]], with num_cols elements, each with num_categories elements, a more verbose mapping from column name to dict of string to one-hot-index position. transformation_name: str, will be 'DistCategoryTransformation' ``` The string_indexer_labels_array comes from Spark's own representation of its StringIndexer class, encapsulated in the `labelsArray` var of StringIndexerModel. See docs here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.StringIndexerModel.html#pyspark.ml.feature.StringIndexerModel.labelsArray and https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.StringIndexer.html#pyspark.ml.feature.StringIndexer Example representation for input data: ``` state wa ca wa # no string here, represents missing value ny ``` ```json { "node_features": { "user": { "state": { "transformation_name": "DistCategoryTransformation", "string_indexer_labels_arrays": [["wa", "ca", "ny"]], "cols": ["state"], "per_col_label_to_one_hot_idx": { "state": { "wa": 0, "ca": 1, "ny": 2 } } } } }, "edge_features": {} } ``` To reconstruct the transformation on a new DF we will iterate over the `cols`, and for each, create one-hot vectors according the position of each string in the corresponding `labels_array`. E.g. given a labels array `["string2", "string1", "string3"]`, when in the input data we encounter`string2`, it will be transformed to `[1, 0, 0]`, since its position in the labels array is position 0. `string1` will have the representation `[0, 1, 0]` etc. All the code changes in the apply function of `DistCategoryTransformation` are meant to build up this representation. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: jalencato <[email protected]>
- Loading branch information