Improvements to OrdinalEncoder, OneHotEncoder, NaiveBayes, LogisticRegression #293

krstopro · 2024-07-31T14:29:23Z

Changelog:

Simplified OrdinalEncoder struct; it now contains only :categories field.
Fixed a bug inside OrdinalEncoder.fit/2 and now it works with tensors of size 1.
Removed opts as 2nd argument from OrdinalEncoder.fit_transform as it is not needed.
Simplified OneHotEncoder struct; it now contains only :ordinal_encoder field.
Renamed :num_classes option to :num_categories as these encoders can be used to encode features, not only target variables.
Added argument validation.
Removed ordinal encoding from NaiveBayes.Complement and LogisticRegression.

Fixes #290.

josevalim

LGTM! Perhaps we just need to remove the commented out code. :)

krstopro · 2024-08-01T14:53:17Z

LGTM! Perhaps we just need to remove the commented out code. :)

Yeah, agreed. :)

I would honestly replace the following line https://github.com/elixir-nx/scholar/blob/main/lib/scholar/preprocessing.ex#L179
with

tensor
|> Nx.new_axis(1)
|> Nx.broadcast({num_samples, num_classes})
|> Nx.equal(Nx.iota({num_samples, num_categories}, axis: 1))

i.e. remove the ordinal encoding that is performed as part of OneHotEncoder and assume tensor contains values between 0 and num_categories - 1. These few lines are used over and over again in Scholar (e.g. NaiveBayes, LogisticRegression, etc.).

josevalim · 2024-08-01T14:57:07Z

I would honestly replace the following line https://github.com/elixir-nx/scholar/blob/main/lib/scholar/preprocessing.ex#L179

I think the preprocessing function should be the same as the module that we invoke (we may remove the functions altogether). But I agree we should probably remove the ordinal encoding from one hot encoding preprocessor module.

krstopro · 2024-08-01T16:34:28Z

I would honestly replace the following line https://github.com/elixir-nx/scholar/blob/main/lib/scholar/preprocessing.ex#L179

I think the preprocessing function should be the same as the module that we invoke (we may remove the functions altogether). But I agree we should probably remove the ordinal encoding from one hot encoding preprocessor module.

Very well, we can have a separate pull request for that.

josevalim · 2024-08-01T17:05:40Z

Definitely, so it all looks good to me!

Krsto Proroković and others added 5 commits July 31, 2024 16:07

Update

cf81723

mix format

32b98fd

Remove Scholar.Preprocessing import from Scholar.Metrics.Classification

b024bd1

mix format

93c7566

Merge branch 'elixir-nx:main' into one-hot-encoder-fix

973fe84

josevalim approved these changes Aug 1, 2024

View reviewed changes

emove commented out code from NaiveBayes.Complement

883ff59

update docstrings

95375fa

krstopro merged commit 7050d32 into elixir-nx:main Aug 1, 2024
0 of 2 checks passed

krstopro deleted the one-hot-encoder-fix branch August 1, 2024 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to OrdinalEncoder, OneHotEncoder, NaiveBayes, LogisticRegression #293

Improvements to OrdinalEncoder, OneHotEncoder, NaiveBayes, LogisticRegression #293

krstopro commented Jul 31, 2024 •

edited

Loading

josevalim left a comment

krstopro commented Aug 1, 2024

josevalim commented Aug 1, 2024

krstopro commented Aug 1, 2024

josevalim commented Aug 1, 2024

Improvements to OrdinalEncoder, OneHotEncoder, NaiveBayes, LogisticRegression #293

Improvements to OrdinalEncoder, OneHotEncoder, NaiveBayes, LogisticRegression #293

Conversation

krstopro commented Jul 31, 2024 • edited Loading

josevalim left a comment

Choose a reason for hiding this comment

krstopro commented Aug 1, 2024

josevalim commented Aug 1, 2024

krstopro commented Aug 1, 2024

josevalim commented Aug 1, 2024

krstopro commented Jul 31, 2024 •

edited

Loading