Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output label transformer support #8

Merged
merged 6 commits into from
Dec 12, 2024
Merged

Output label transformer support #8

merged 6 commits into from
Dec 12, 2024

Conversation

Nepherhotep
Copy link
Collaborator

@Nepherhotep Nepherhotep commented Dec 10, 2024

What changed:

  • Add output label transformer to allow ScikitLearn pipeline to return the class instead of class id.

Motivation:

  • For the classification model, predicting one of many classes, should return the original class labels.

The example trained pipeline:

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

import numpy as np
import pandas as pd

# Simulate a DataFrame (for illustration)
df = pd.DataFrame({
    'height': np.random.rand(100) * 200,
    'weight': np.random.rand(100) * 100,
    'weapon': np.random.choice(['Sword', 'Bow', 'Magic'], 100),
    'creature': np.random.choice(['Elf', 'Orc', 'Human'], 100)
})

# Separate features and target
X = df[['height', 'weight', 'weapon']]
y = df['creature']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define a preprocessor with StandardScaler for numeric features
# and OneHotEncoder for categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['height', 'weight']),
        ('cat', OneHotEncoder(drop='first'), ['weapon'])
    ]
)

# Define the target encoder
label_encoder = LabelEncoder()
y_train_transformed = label_encoder.fit_transform(y_train)
y_test_transformed = label_encoder.transform(y_test)

# Define the main pipeline with prediction decoding
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LabelEncoderTransformer(
        model=XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
        label_encoder=label_encoder
    ))
])

# Train the model
pipeline.fit(X_train, y_train_transformed)

This pipeline will return categories directly, without having to fetch the encoder from any storage (or extracting it from the given pipeline).

# Make predictions (automatically decoded to original labels)
y_pred = pipeline.predict(X_test)

Output:

array(['Elf', 'Orc', 'Human', 'Elf', 'Elf', 'Orc', 'Elf', 'Human', 'Elf',
       'Orc', 'Human', 'Elf', 'Orc', 'Elf', 'Human', 'Elf', 'Human',
       'Orc', 'Human', 'Elf'], dtype=object)

Copy link

@Ardhimas Ardhimas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small suggestions/questions


@pytest.fixture
def random_seed():
return 42

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't random lol I think you should just rename it to test_seed so it's not misleading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. It's a seed for random values, usually it's called this way. Was quite random after I rolled the dice :-D
I can rename it to default_seed, if you like it better?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid the prefix test_ it is commonly used as a prefix for tests in python.

perhaps random_state or simply pass random_state=42 directly in the code below instead of a fixture.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to "default_seed" to avoid "random" in its name

array-like: Predicted class labels in their original form.
"""
encoded_predictions = self.model.predict(X)
return self.label_encoder.inverse_transform(encoded_predictions)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an inverse_transform here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label encoder can encode labels into ids and back. That's necessary, since the model can't usually operate with labels directly, and require a preprocessing step.
That adds inconvenience for model use, since the models spit out the ids, and we need to make a backward conversion in the end. This PR is all about making this inverse transform after the prediction is done, and allow the pipeline to spit out string labels.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal for the transformer to not store the results of fitting or prediction anywhere within its class variables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transformer saves the results of fitting inside the LabelEncoder, but as this class is provided for us by scikit-learn, it's not reflected in this PR.
Essentially, the LabelEncoderTransformer is a proxy class that allows us to put labelencoder directly into scikitlearn pipeline, and make it returning string labels, instead of their ids.

encoded_predictions = self.model.predict(X)
return self.label_encoder.inverse_transform(encoded_predictions)

def predict_proba(self, X):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add Type info on all places, a follow up PR is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added type to most of places, except "array-like" as it can allow both pandas dataframes and numpy arrays.

predictions = transformer.predict(X)

# Ensure predictions are in original class labels
assert all(label in label_encoder.classes_ for label in predictions)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is only one encoder in this case, can we assert equality against the actual object instead of the loop so it is clear on what is expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced it with the actual results - now it's much more obvious about the output data structure.

# Ensure probabilities match the expected structure
for sample_probs in probabilities:
assert len(sample_probs) == len(label_encoder.classes_)
for class_prob in sample_probs:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similar to the above comment, can we assert against raw expected value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated as well!

orient_express/sklearn_pipeline.py Show resolved Hide resolved
@Nepherhotep Nepherhotep merged commit 0b0248b into main Dec 12, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants