[FSTORE-1672] Allow multiple on-demand features to be returned from an on-demand transformation function and allow passing of local variables to a transformation function #452

manu-sj · 2025-01-17T09:38:04Z

This PR adds support for

Returning multiple on-features from an on-demand transformation function.
Passing context variables to transformation functions.
Inserting DataFrames that already contains on-demand features into a feature group with on-demand transformation functions.
Some changes in get_feature_vector to improve user experience:

Implicitly use argument passed in entries as request_parameters if they are not explicitly specified in it.
Implicitly use argument passed in their "un-prefixed form" in request parameter if they are not provided in their "prefixed form".
Add support for specifying return_type in both transform and compute_on_demand_features.

This PR also fixes a few bugs:

Checking for missing request parameters were being done even when transform was set to False in get_feature_vector`.
Adding transformation functions to label features causes get_feature_vector call to fail.

JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1672

Priority for Review: -

Related PRs: -

How Has This Been Tested?

Unit Tests
Integration Tests
Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

bubriks

Looks good, but i would like to avoid duplication where possible.

bubriks · 2025-01-31T08:47:45Z

python/hsfs/hopsworks_udf.py

+        if UDFKeyWords.STATISTICS.value in arg_list:
+            arg_list.remove(UDFKeyWords.STATISTICS.value)
+        if UDFKeyWords.CONTEXT.value in arg_list:
+            arg_list.remove(UDFKeyWords.CONTEXT.value)


might be better like this

keywords_to_remove = {UDFKeyWords.STATISTICS.value, UDFKeyWords.CONTEXT.value} arg_list = [arg for arg in arg_list if arg not in keywords_to_remove]

Yes I agree, I adapted the code to the format you have mentioned.

bubriks · 2025-01-31T08:50:34Z

python/hsfs/hopsworks_udf.py

+            scope.update({UDFKeyWords.STATISTICS.value: self.transformation_statistics})
+        if self.transformation_context:
+            scope.update({UDFKeyWords.CONTEXT.value: self.transformation_context})


some duplication with scope.update

also might be cleaner to do something like this

scope.update({ k: v for k, v in { UDFKeyWords.STATISTICS.value: self.transformation_statistics, UDFKeyWords.CONTEXT.value: self.transformation_context, "_output_col_names": self.output_column_names, "_date_time_output_index": date_time_output_index }.items() if v is not None })

Easy to add more keys and values later as well

Yes agreed, I added a common function for preparing the scope for UDF and also added a common dictionary in the function that can be used to inject variables that are required for both pandas and python UDFs.

bubriks · 2025-01-31T08:55:13Z

python/hsfs/transformation_function.py

+                if len(self.__hopsworks_udf.return_types) > 1:
+                    output_col_names = [
+                        f"{self.__hopsworks_udf.function_name}_{i}"
+                        for i in range(0, len(self.__hopsworks_udf.return_types))
+                    ]
+                else:
+                    output_col_names = [self.__hopsworks_udf.function_name]


same as for the TransformationType.MODEL_DEPENDENT only naming, would be nice to avoid duplication.

Refactored the code so as to avoid duplication of code and make it cleaner.

bubriks · 2025-01-31T08:56:47Z

python/tests/engine/test_spark.py

+                "col_2": [True, False],
+                "plus_one_col_0_": [21, 22],
+            }
+        )  # todo why it doesnt return int?


is this to resolved?

Yes I think it was something I forgot to remove from my initial PR's for model dependent transformations and then it was copy pasted. Sorry my bad removed the comments

bubriks · 2025-01-31T09:04:45Z

python/hsfs/core/vector_server.py

+            missing_features = required_features - set(feature_vectors.columns)
+            if missing_features:
+                raise exceptions.FeatureStoreException(
+                    f"The input feature vector is missing the following required features: `{'`, `'.join(missing_features)}`. Please include these features in the feature vector."


Duplicate, maybe better to have a method that verifies this.

Yes it might to better to have all the verifications in one places so that it might be easier to go and add more if require. Added a new function _validate_input_features that performs the validations.

bubriks · 2025-01-31T09:06:36Z

python/hsfs/core/vector_server.py

+                if prefixed_feature in request_parameter.keys():
+                    feature_value = request_parameter[prefixed_feature]
+                elif unprefixed_feature in request_parameter.keys():
+                    feature_value = request_parameter[unprefixed_feature]
+                else:
+                    feature_value = rows[prefixed_feature]


could replace it with:

Suggested change

if prefixed_feature in request_parameter.keys():

feature_value = request_parameter[prefixed_feature]

elif unprefixed_feature in request_parameter.keys():

feature_value = request_parameter[unprefixed_feature]

else:

feature_value = rows[prefixed_feature]

feature_value = request_parameter.get(prefixed_feature,

request_parameter.get(unprefixed_feature,

rows.get(prefixed_feature)))

Replaced as per suggestion.

…context

…transform is set to true

…set materialization jobs from the python engiine

…d feature returned from a single transformaiton function

…andas dataframe can also be passed to get feature vector

…n after tranformations

…rameters

manu-sj force-pushed the FSTORE-1672 branch 3 times, most recently from 73da6f0 to 6901443 Compare January 28, 2025 13:58

manu-sj marked this pull request as ready for review January 29, 2025 20:09

manu-sj force-pushed the FSTORE-1672 branch from 95c9921 to 11ec927 Compare January 30, 2025 08:17

manu-sj requested a review from bubriks January 30, 2025 12:36

bubriks requested changes Jan 31, 2025

View reviewed changes

manu-sj added 9 commits January 31, 2025 16:37

working code for many to many transformation and also transformation …

ce25599

…context

adding comments and tests

45abf24

update feature group schema based on transformation function only if …

523dbd7

…transform is set to true

throw exception if transformation context is passed to training datat…

f5d0299

…set materialization jobs from the python engiine

adapting check_missing_request_parameters to handle multiple on-deman…

bdaf23c

…d feature returned from a single transformaiton function

checking excplicitly that the passed feature vector is None so that p…

21b2b2d

…andas dataframe can also be passed to get feature vector

adding else to handle usecase in which features have to be overwritte…

2fee073

…n after tranformations

adding null check for entries before trying to add them to request pa…

3aa510d

…rameters

addressing review comments

a17901b

manu-sj force-pushed the FSTORE-1672 branch from 75f338a to a17901b Compare January 31, 2025 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSTORE-1672] Allow multiple on-demand features to be returned from an on-demand transformation function and allow passing of local variables to a transformation function #452

[FSTORE-1672] Allow multiple on-demand features to be returned from an on-demand transformation function and allow passing of local variables to a transformation function #452

manu-sj commented Jan 17, 2025 •

edited

Loading

bubriks left a comment

bubriks Jan 31, 2025

manu-sj Jan 31, 2025

bubriks Jan 31, 2025

bubriks Jan 31, 2025

manu-sj Jan 31, 2025

bubriks Jan 31, 2025

manu-sj Jan 31, 2025

bubriks Jan 31, 2025

manu-sj Jan 31, 2025

bubriks Jan 31, 2025

manu-sj Jan 31, 2025

bubriks Jan 31, 2025

manu-sj Jan 31, 2025

[FSTORE-1672] Allow multiple on-demand features to be returned from an on-demand transformation function and allow passing of local variables to a transformation function #452

Are you sure you want to change the base?

[FSTORE-1672] Allow multiple on-demand features to be returned from an on-demand transformation function and allow passing of local variables to a transformation function #452

Conversation

manu-sj commented Jan 17, 2025 • edited Loading

bubriks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manu-sj commented Jan 17, 2025 •

edited

Loading