Incorrect uBoostClassifier.predict_proba() outputs #87

ehhov · 2024-10-14T09:56:24Z

Dear developers,

I'm trying to use uBoost as an alternative to usual BDTs to which one can add more variables and maintain uniformity of the response over a specific variable (invariant mass).

I'm using the intuitively named and properly described in the documentation methods uBoostClassifier.fit() and uBoostClassifier.predict_proba(), but the results are far from expected. The background probability is distributed around 0.5, and signal is flat (signal is flat over the response, not the response is flat over the invariant mass).

When I dig a little deeper into the source code of hep_ml/uboost.py, I find that predict_proba() calculates these probabilities a little weirdly. It sums the output of uBoostBDT._uboost_predict_score() and then transforms it into probability using a sigmoid in hep_ml/commonutils.py score_to_proba().

hep_ml/hep_ml/uboost.py

Lines 532 to 540 in 442a321

    
               def predict_proba(self, X): 
        
                   """Predict probabilities 
        
                   :param X: data, pandas.DataFrame of shape [n_samples, n_features] 
        
                   :return: array of shape [n_samples, n_classes] with probabilities. 
        
                   """ 
        
                   X = self._get_train_features(X) 
        
                   score = sum(clf._uboost_predict_score(X) for clf in self.classifiers) 
        
                   return commonutils.score_to_proba(score / self.efficiency_steps)

The issue is, uBoostBDT._uboost_predict_score() doesn't return just the score. It also converts it using hep_ml/commonutils.py sigmoid_function().

hep_ml/hep_ml/uboost.py

Lines 363 to 366 in 442a321

    
           def _uboost_predict_score(self, X): 
        
               """Method added specially for uBoostClassifier""" 
        
               return sigmoid_function(self.decision_function(X) - self.score_cut, 
        
                                       self.smoothing)

As a result, the values that are summed are all positive (they are outputs of a sigmoid function). When one feeds the sum to the second sigmoid, the probability can only be larger than 0.5.
For the background, individual outputs of the first sigmoids are around zero, so the second sigmoid gives around expit(0) = 0.5.
For the signal, the individual outputs are close to one, and their sum is equal to the number of sigmoids. Then it is around expit(1) = 0.73....
Both these values are far from 0 and 1, which they should be since they are probabilities to be signal.

This behavior looks quite incorrect, but I don't even know how to fix it properly, at which step.

The text was updated successfully, but these errors were encountered:

arogozhnikov · 2024-10-14T16:41:50Z

Hi Kerim,

thanks for looking, sigmoid should be applied only once, before averaging; i.e. this call

hep_ml/hep_ml/uboost.py

Line 540 in 442a321

return commonutils.score_to_proba(score / self.efficiency_steps)

should be replaced with just averaging, then result[:, 1] = x, result[:, 0] = 1 - x. (PR that fixes this is welcome)
Other than probability calibration, this won't change result (e.g. flatness).

Comment: Sigmoid in uboost.py#365 is added to utilize predictions of individual uboostBDTs more efficiently.

ehhov mentioned this issue Oct 15, 2024

Fix probability calculation for uBoostClassifier #88

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect uBoostClassifier.predict_proba() outputs #87

Incorrect uBoostClassifier.predict_proba() outputs #87

ehhov commented Oct 14, 2024

arogozhnikov commented Oct 14, 2024

Incorrect uBoostClassifier.predict_proba() outputs #87

Incorrect uBoostClassifier.predict_proba() outputs #87

Comments

ehhov commented Oct 14, 2024

arogozhnikov commented Oct 14, 2024