Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect uBoostClassifier.predict_proba() outputs #87

Open
ehhov opened this issue Oct 14, 2024 · 1 comment
Open

Incorrect uBoostClassifier.predict_proba() outputs #87

ehhov opened this issue Oct 14, 2024 · 1 comment

Comments

@ehhov
Copy link

ehhov commented Oct 14, 2024

Dear developers,

I'm trying to use uBoost as an alternative to usual BDTs to which one can add more variables and maintain uniformity of the response over a specific variable (invariant mass).

I'm using the intuitively named and properly described in the documentation methods uBoostClassifier.fit() and uBoostClassifier.predict_proba(), but the results are far from expected. The background probability is distributed around 0.5, and signal is flat (signal is flat over the response, not the response is flat over the invariant mass).

When I dig a little deeper into the source code of hep_ml/uboost.py, I find that predict_proba() calculates these probabilities a little weirdly. It sums the output of uBoostBDT._uboost_predict_score() and then transforms it into probability using a sigmoid in hep_ml/commonutils.py score_to_proba().

hep_ml/hep_ml/uboost.py

Lines 532 to 540 in 442a321

def predict_proba(self, X):
"""Predict probabilities
:param X: data, pandas.DataFrame of shape [n_samples, n_features]
:return: array of shape [n_samples, n_classes] with probabilities.
"""
X = self._get_train_features(X)
score = sum(clf._uboost_predict_score(X) for clf in self.classifiers)
return commonutils.score_to_proba(score / self.efficiency_steps)
The issue is, uBoostBDT._uboost_predict_score() doesn't return just the score. It also converts it using hep_ml/commonutils.py sigmoid_function().

hep_ml/hep_ml/uboost.py

Lines 363 to 366 in 442a321

def _uboost_predict_score(self, X):
"""Method added specially for uBoostClassifier"""
return sigmoid_function(self.decision_function(X) - self.score_cut,
self.smoothing)
As a result, the values that are summed are all positive (they are outputs of a sigmoid function). When one feeds the sum to the second sigmoid, the probability can only be larger than 0.5.
For the background, individual outputs of the first sigmoids are around zero, so the second sigmoid gives around expit(0) = 0.5.
For the signal, the individual outputs are close to one, and their sum is equal to the number of sigmoids. Then it is around expit(1) = 0.73....
Both these values are far from 0 and 1, which they should be since they are probabilities to be signal.

This behavior looks quite incorrect, but I don't even know how to fix it properly, at which step.

@arogozhnikov
Copy link
Owner

Hi Kerim,

thanks for looking, sigmoid should be applied only once, before averaging; i.e. this call

return commonutils.score_to_proba(score / self.efficiency_steps)

should be replaced with just averaging, then result[:, 1] = x, result[:, 0] = 1 - x. (PR that fixes this is welcome)
Other than probability calibration, this won't change result (e.g. flatness).

Comment: Sigmoid in uboost.py#365 is added to utilize predictions of individual uboostBDTs more efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants