-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSTORE-1672] Allow multiple on-demand features to be returned from an on-demand transformation function and allow passing of local variables to a transformation function #452
base: main
Are you sure you want to change the base?
Conversation
73da6f0
to
6901443
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but i would like to avoid duplication where possible.
python/hsfs/hopsworks_udf.py
Outdated
if UDFKeyWords.STATISTICS.value in arg_list: | ||
arg_list.remove(UDFKeyWords.STATISTICS.value) | ||
if UDFKeyWords.CONTEXT.value in arg_list: | ||
arg_list.remove(UDFKeyWords.CONTEXT.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be better like this
keywords_to_remove = {UDFKeyWords.STATISTICS.value, UDFKeyWords.CONTEXT.value}
arg_list = [arg for arg in arg_list if arg not in keywords_to_remove]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, I adapted the code to the format you have mentioned.
python/hsfs/hopsworks_udf.py
Outdated
scope.update({UDFKeyWords.STATISTICS.value: self.transformation_statistics}) | ||
if self.transformation_context: | ||
scope.update({UDFKeyWords.CONTEXT.value: self.transformation_context}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some duplication with scope.update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also might be cleaner to do something like this
scope.update({
k: v for k, v in {
UDFKeyWords.STATISTICS.value: self.transformation_statistics,
UDFKeyWords.CONTEXT.value: self.transformation_context,
"_output_col_names": self.output_column_names,
"_date_time_output_index": date_time_output_index
}.items() if v is not None
})
Easy to add more keys and values later as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed, I added a common function for preparing the scope for UDF and also added a common dictionary in the function that can be used to inject variables that are required for both pandas and python UDFs.
if len(self.__hopsworks_udf.return_types) > 1: | ||
output_col_names = [ | ||
f"{self.__hopsworks_udf.function_name}_{i}" | ||
for i in range(0, len(self.__hopsworks_udf.return_types)) | ||
] | ||
else: | ||
output_col_names = [self.__hopsworks_udf.function_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for the TransformationType.MODEL_DEPENDENT
only naming, would be nice to avoid duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored the code so as to avoid duplication of code and make it cleaner.
python/tests/engine/test_spark.py
Outdated
"col_2": [True, False], | ||
"plus_one_col_0_": [21, 22], | ||
} | ||
) # todo why it doesnt return int? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this to resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it was something I forgot to remove from my initial PR's for model dependent transformations and then it was copy pasted. Sorry my bad removed the comments
python/hsfs/core/vector_server.py
Outdated
missing_features = required_features - set(feature_vectors.columns) | ||
if missing_features: | ||
raise exceptions.FeatureStoreException( | ||
f"The input feature vector is missing the following required features: `{'`, `'.join(missing_features)}`. Please include these features in the feature vector." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate, maybe better to have a method that verifies this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it might to better to have all the verifications in one places so that it might be easier to go and add more if require. Added a new function _validate_input_features
that performs the validations.
python/hsfs/core/vector_server.py
Outdated
if prefixed_feature in request_parameter.keys(): | ||
feature_value = request_parameter[prefixed_feature] | ||
elif unprefixed_feature in request_parameter.keys(): | ||
feature_value = request_parameter[unprefixed_feature] | ||
else: | ||
feature_value = rows[prefixed_feature] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could replace it with:
if prefixed_feature in request_parameter.keys(): | |
feature_value = request_parameter[prefixed_feature] | |
elif unprefixed_feature in request_parameter.keys(): | |
feature_value = request_parameter[unprefixed_feature] | |
else: | |
feature_value = rows[prefixed_feature] | |
feature_value = request_parameter.get(prefixed_feature, | |
request_parameter.get(unprefixed_feature, | |
rows.get(prefixed_feature))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced as per suggestion.
…transform is set to true
…set materialization jobs from the python engiine
…d feature returned from a single transformaiton function
…andas dataframe can also be passed to get feature vector
…n after tranformations
This PR adds support for
entries
asrequest_parameters
if they are not explicitly specified in it.transform
andcompute_on_demand_features
.This PR also fixes a few bugs:
transform
was set toFalse
in get_feature_vector`.JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1672
Priority for Review: -
Related PRs: -
How Has This Been Tested?
Checklist For The Assigned Reviewer: