-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output label transformer support #8
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple small suggestions/questions
|
||
@pytest.fixture | ||
def random_seed(): | ||
return 42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't random lol I think you should just rename it to test_seed
so it's not misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. It's a seed for random values, usually it's called this way. Was quite random after I rolled the dice :-D
I can rename it to default_seed, if you like it better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid the prefix test_
it is commonly used as a prefix for tests in python.
perhaps random_state
or simply pass random_state=42
directly in the code below instead of a fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to "default_seed" to avoid "random" in its name
array-like: Predicted class labels in their original form. | ||
""" | ||
encoded_predictions = self.model.predict(X) | ||
return self.label_encoder.inverse_transform(encoded_predictions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an inverse_transform here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label encoder can encode labels into ids and back. That's necessary, since the model can't usually operate with labels directly, and require a preprocessing step.
That adds inconvenience for model use, since the models spit out the ids, and we need to make a backward conversion in the end. This PR is all about making this inverse transform after the prediction is done, and allow the pipeline to spit out string labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it normal for the transformer to not store the results of fitting or prediction anywhere within its class variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transformer saves the results of fitting inside the LabelEncoder, but as this class is provided for us by scikit-learn, it's not reflected in this PR.
Essentially, the LabelEncoderTransformer is a proxy class that allows us to put labelencoder directly into scikitlearn pipeline, and make it returning string labels, instead of their ids.
encoded_predictions = self.model.predict(X) | ||
return self.label_encoder.inverse_transform(encoded_predictions) | ||
|
||
def predict_proba(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add Type info on all places, a follow up PR is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added type to most of places, except "array-like" as it can allow both pandas dataframes and numpy arrays.
tests/test_sklearn_pipeline.py
Outdated
predictions = transformer.predict(X) | ||
|
||
# Ensure predictions are in original class labels | ||
assert all(label in label_encoder.classes_ for label in predictions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is only one encoder in this case, can we assert equality against the actual object instead of the loop so it is clear on what is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced it with the actual results - now it's much more obvious about the output data structure.
tests/test_sklearn_pipeline.py
Outdated
# Ensure probabilities match the expected structure | ||
for sample_probs in probabilities: | ||
assert len(sample_probs) == len(label_encoder.classes_) | ||
for class_prob in sample_probs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
: similar to the above comment, can we assert against raw expected value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated as well!
What changed:
Motivation:
The example trained pipeline:
This pipeline will return categories directly, without having to fetch the encoder from any storage (or extracting it from the given pipeline).
Output: