Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1672] Allow multiple on-demand features to be returned from an on-demand transformation function and allow passing of local variables to a transformation function #452

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

manu-sj
Copy link
Contributor

@manu-sj manu-sj commented Jan 17, 2025

This PR adds support for

  1. Returning multiple on-features from an on-demand transformation function.
  2. Passing context variables to transformation functions.
  3. Inserting DataFrames that already contains on-demand features into a feature group with on-demand transformation functions.
  4. Some changes in get_feature_vector to improve user experience:
  • Implicitly use argument passed in entries as request_parameters if they are not explicitly specified in it.
  • Implicitly use argument passed in their "un-prefixed form" in request parameter if they are not provided in their "prefixed form".
  • Add support for specifying return_type in both transform and compute_on_demand_features.

This PR also fixes a few bugs:

  1. Checking for missing request parameters were being done even when transform was set to False in get_feature_vector`.
  2. Adding transformation functions to label features causes get_feature_vector call to fail.

JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1672

Priority for Review: -

Related PRs: -

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

@manu-sj manu-sj force-pushed the FSTORE-1672 branch 3 times, most recently from 73da6f0 to 6901443 Compare January 28, 2025 13:58
@manu-sj manu-sj marked this pull request as ready for review January 29, 2025 20:09
@manu-sj manu-sj requested a review from bubriks January 30, 2025 12:36
Copy link
Contributor

@bubriks bubriks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but i would like to avoid duplication where possible.

Comment on lines 433 to 436
if UDFKeyWords.STATISTICS.value in arg_list:
arg_list.remove(UDFKeyWords.STATISTICS.value)
if UDFKeyWords.CONTEXT.value in arg_list:
arg_list.remove(UDFKeyWords.CONTEXT.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better like this

keywords_to_remove = {UDFKeyWords.STATISTICS.value, UDFKeyWords.CONTEXT.value}
arg_list = [arg for arg in arg_list if arg not in keywords_to_remove]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I adapted the code to the format you have mentioned.

Comment on lines 592 to 594
scope.update({UDFKeyWords.STATISTICS.value: self.transformation_statistics})
if self.transformation_context:
scope.update({UDFKeyWords.CONTEXT.value: self.transformation_context})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some duplication with scope.update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also might be cleaner to do something like this

scope.update({
    k: v for k, v in {
        UDFKeyWords.STATISTICS.value: self.transformation_statistics,
        UDFKeyWords.CONTEXT.value: self.transformation_context,
        "_output_col_names": self.output_column_names,
        "_date_time_output_index": date_time_output_index
    }.items() if v is not None
})

Easy to add more keys and values later as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed, I added a common function for preparing the scope for UDF and also added a common dictionary in the function that can be used to inject variables that are required for both pandas and python UDFs.

Comment on lines 285 to 291
if len(self.__hopsworks_udf.return_types) > 1:
output_col_names = [
f"{self.__hopsworks_udf.function_name}_{i}"
for i in range(0, len(self.__hopsworks_udf.return_types))
]
else:
output_col_names = [self.__hopsworks_udf.function_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for the TransformationType.MODEL_DEPENDENT only naming, would be nice to avoid duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the code so as to avoid duplication of code and make it cleaner.

"col_2": [True, False],
"plus_one_col_0_": [21, 22],
}
) # todo why it doesnt return int?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it was something I forgot to remove from my initial PR's for model dependent transformations and then it was copy pasted. Sorry my bad removed the comments

Comment on lines 688 to 691
missing_features = required_features - set(feature_vectors.columns)
if missing_features:
raise exceptions.FeatureStoreException(
f"The input feature vector is missing the following required features: `{'`, `'.join(missing_features)}`. Please include these features in the feature vector."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate, maybe better to have a method that verifies this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it might to better to have all the verifications in one places so that it might be easier to go and add more if require. Added a new function _validate_input_features that performs the validations.

Comment on lines 1139 to 1144
if prefixed_feature in request_parameter.keys():
feature_value = request_parameter[prefixed_feature]
elif unprefixed_feature in request_parameter.keys():
feature_value = request_parameter[unprefixed_feature]
else:
feature_value = rows[prefixed_feature]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could replace it with:

Suggested change
if prefixed_feature in request_parameter.keys():
feature_value = request_parameter[prefixed_feature]
elif unprefixed_feature in request_parameter.keys():
feature_value = request_parameter[unprefixed_feature]
else:
feature_value = rows[prefixed_feature]
feature_value = request_parameter.get(prefixed_feature,
request_parameter.get(unprefixed_feature,
rows.get(prefixed_feature)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced as per suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants