Output label transformer support #8

Nepherhotep · 2024-12-10T20:14:28Z

What changed:

Add output label transformer to allow ScikitLearn pipeline to return the class instead of class id.

Motivation:

For the classification model, predicting one of many classes, should return the original class labels.

The example trained pipeline:

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

import numpy as np
import pandas as pd

# Simulate a DataFrame (for illustration)
df = pd.DataFrame({
    'height': np.random.rand(100) * 200,
    'weight': np.random.rand(100) * 100,
    'weapon': np.random.choice(['Sword', 'Bow', 'Magic'], 100),
    'creature': np.random.choice(['Elf', 'Orc', 'Human'], 100)
})

# Separate features and target
X = df[['height', 'weight', 'weapon']]
y = df['creature']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define a preprocessor with StandardScaler for numeric features
# and OneHotEncoder for categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['height', 'weight']),
        ('cat', OneHotEncoder(drop='first'), ['weapon'])
    ]
)

# Define the target encoder
label_encoder = LabelEncoder()
y_train_transformed = label_encoder.fit_transform(y_train)
y_test_transformed = label_encoder.transform(y_test)

# Define the main pipeline with prediction decoding
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LabelEncoderTransformer(
        model=XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
        label_encoder=label_encoder
    ))
])

# Train the model
pipeline.fit(X_train, y_train_transformed)

This pipeline will return categories directly, without having to fetch the encoder from any storage (or extracting it from the given pipeline).

# Make predictions (automatically decoded to original labels)
y_pred = pipeline.predict(X_test)

Output:

array(['Elf', 'Orc', 'Human', 'Elf', 'Elf', 'Orc', 'Elf', 'Human', 'Elf',
       'Orc', 'Human', 'Elf', 'Orc', 'Elf', 'Human', 'Elf', 'Human',
       'Orc', 'Human', 'Elf'], dtype=object)

Ardhimas

Couple small suggestions/questions

Ardhimas · 2024-12-11T16:10:42Z

tests/conftest.py

+
+@pytest.fixture
+def random_seed():
+    return 42


This isn't random lol I think you should just rename it to test_seed so it's not misleading.

Fair point. It's a seed for random values, usually it's called this way. Was quite random after I rolled the dice :-D
I can rename it to default_seed, if you like it better?

I would avoid the prefix test_ it is commonly used as a prefix for tests in python.

perhaps random_state or simply pass random_state=42 directly in the code below instead of a fixture.

I renamed it to "default_seed" to avoid "random" in its name

Ardhimas · 2024-12-11T16:11:44Z

orient_express/sklearn_pipeline.py

+            array-like: Predicted class labels in their original form.
+        """
+        encoded_predictions = self.model.predict(X)
+        return self.label_encoder.inverse_transform(encoded_predictions)


Why do we need an inverse_transform here?

label encoder can encode labels into ids and back. That's necessary, since the model can't usually operate with labels directly, and require a preprocessing step.
That adds inconvenience for model use, since the models spit out the ids, and we need to make a backward conversion in the end. This PR is all about making this inverse transform after the prediction is done, and allow the pipeline to spit out string labels.

Ardhimas · 2024-12-11T16:15:02Z

orient_express/sklearn_pipeline.py

Is it normal for the transformer to not store the results of fitting or prediction anywhere within its class variables?

The transformer saves the results of fitting inside the LabelEncoder, but as this class is provided for us by scikit-learn, it's not reflected in this PR.
Essentially, the LabelEncoderTransformer is a proxy class that allows us to put labelencoder directly into scikitlearn pipeline, and make it returning string labels, instead of their ids.

appraveen · 2024-12-11T16:32:15Z

orient_express/sklearn_pipeline.py

+        encoded_predictions = self.model.predict(X)
+        return self.label_encoder.inverse_transform(encoded_predictions)
+
+    def predict_proba(self, X):


let's add Type info on all places, a follow up PR is fine.

Added type to most of places, except "array-like" as it can allow both pandas dataframes and numpy arrays.

appraveen · 2024-12-11T16:39:13Z

tests/test_sklearn_pipeline.py

+    predictions = transformer.predict(X)
+
+    # Ensure predictions are in original class labels
+    assert all(label in label_encoder.classes_ for label in predictions)


there is only one encoder in this case, can we assert equality against the actual object instead of the loop so it is clear on what is expected.

Replaced it with the actual results - now it's much more obvious about the output data structure.

appraveen · 2024-12-11T16:41:31Z

tests/test_sklearn_pipeline.py

+    # Ensure probabilities match the expected structure
+    for sample_probs in probabilities:
+        assert len(sample_probs) == len(label_encoder.classes_)
+        for class_prob in sample_probs:


nit: similar to the above comment, can we assert against raw expected value?

updated as well!

orient_express/sklearn_pipeline.py

Nepherhotep added 3 commits December 10, 2024 15:05

Updated

1a05057

Updated

c6eb25e

Updated

68af971

Ardhimas reviewed Dec 11, 2024

View reviewed changes

appraveen reviewed Dec 11, 2024

View reviewed changes

appraveen approved these changes Dec 11, 2024

View reviewed changes

Nepherhotep added 3 commits December 11, 2024 20:03

Updated

33705c1

Updated

f003689

Updated

4885244

Nepherhotep merged commit 0b0248b into main Dec 12, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output label transformer support #8

Output label transformer support #8

Nepherhotep commented Dec 10, 2024 •

edited

Loading

Ardhimas left a comment

Ardhimas Dec 11, 2024

Nepherhotep Dec 11, 2024

appraveen Dec 11, 2024

Nepherhotep Dec 12, 2024

Ardhimas Dec 11, 2024

Nepherhotep Dec 11, 2024

Ardhimas Dec 11, 2024

Nepherhotep Dec 11, 2024

appraveen Dec 11, 2024

Nepherhotep Dec 12, 2024

appraveen Dec 11, 2024

Nepherhotep Dec 12, 2024

appraveen Dec 11, 2024

Nepherhotep Dec 12, 2024

Output label transformer support #8

Output label transformer support #8

Conversation

Nepherhotep commented Dec 10, 2024 • edited Loading

Ardhimas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nepherhotep commented Dec 10, 2024 •

edited

Loading