Exporting/saving/reusing the reweighting formula #33

bifani · 2016-05-11T17:31:31Z

Sometimes one would like to use a control sample, e.g. because more abundant, to determine MC weights to be then applied to other, e.g. more rare, samples

For this reason it would be very useful if hep_ml.reweight could export the "reweighting formula" in some format, e.g. ROOT, so that it can be reused also from different programming languages

Thanks

arogozhnikov · 2016-05-12T13:31:35Z

This is a frequent question (or family of questions) from physicists, who are interested in applying reweighting to one more data sample. Below I give solutions for different situations.

Working from the same script

Frequently applicable, but for some reason ignored by physicists (ROOT influence?) solution is read this file inside the same script/notebook and apply reweigher.

You can store the weights column using recipe from this issue.

When you need to store formula

Possible reasons:

data is not available
need to transfer formula to different machine
keep formula for future comparison / reproducing results.

You can use cPickle. Works as following:

import cPickle as pickle
# saving formula
with open('reweighter.pkl', 'w') as f:
    pickle.dump(reweighter, f)

#loading formula
with open('reweighter.pkl') as f:
    reweighter = pickle.load(f)

Exporting to TMVA

(needed when you need to build it inside some production script / experiment)

When applying formula, reweighter is not much different from simple gradient boosting / random forest (see how predict_weights works).

hep_ml uses own BDT, but it is easily converted from/to sklearn.

There are solutions, which convert sklearn's trees to TMVA format: koza4ok and sklearn-pmml.

Warning: I haven't tried any of those, since I am not using TMVA, so I expect many caveats on that way. If someone tried and succeeded with exporting to TMVA, let me know.

bifani · 2016-05-12T17:02:35Z

Hi Alex,

thanks a lot for the quick feedback!
cPickle looks like what I need, I'll give this a go

Regards,
s.

kpedro88 · 2018-10-03T19:40:53Z

I have a question about converting from hep_ml BDTs to sklearn BDTs. I am trying to use the "exporting to TMVA" method via koza4ok, and it works with a few tweaks:

classifiers['uGBFL'].loss_ = classifiers['uGBFL'].loss
classifiers['uGBFL'].loss_.K = 1
classifiers['uGBFL'].estimators_ = np.empty((classifiers['uGBFL'].n_estimators, classifiers['uGBFL'].loss_.K), dtype=np.object)
for i,est in enumerate(classifiers['uGBFL'].estimators): classifiers['uGBFL'].estimators_[i] = est[0]

However, I am not sure the last line gives the correct output. In UGradientBoostingClassifier, the estimators_ member is a list of [tree, leaf_values]. The leaf_values first come from the tree, but then get updated:

hep_ml/hep_ml/gradientboosting.py

Lines 136 to 144 in 41e97d5

    
           # update tree leaves 
        
           leaf_values = tree.get_leaf_values() 
        
           if self.update_tree: 
        
               terminal_regions = tree.transform(X) 
        
               leaf_values = self.loss.prepare_new_leaves_values(terminal_regions, leaf_values=leaf_values, 
        
                                                                 y_pred=y_pred) 
        
           y_pred += self.learning_rate * self._estimate_tree(tree, leaf_values=leaf_values, X=X) 
        
           self.estimators.append([tree, leaf_values])

At the end, get_leaf_values() returns a different array than the leaf_values stored in the estimators_ list:

>>> print classifiers['uGBFL'].estimators[0][0].get_leaf_values()
[ 0.01252273 -1.72148748 -2.77744433 -1.07583091  0.29113487  0.16071584
  0.05392691  1.75249969  2.29887652]
>>> print classifiers['uGBFL'].estimators[0][1]                  
[ 0.          0.         -2.6523975  -1.15883605  0.          0.
  0.08844491  1.44762732  2.12097526]

Should I export the array from get_leaf_values(), or use the leaf_values from the list?

arogozhnikov · 2018-10-05T21:43:02Z

Hi @kpedro88
Your analysis is correct - only leaf id predicted by the tree is important, not leaf values; leaf values that are stored separately then used, (tree, leaf_values). So, leaf values stored inside the tree are ignored completely.

For conversion, almost surely you'll need to do the following (not tested, maybe needs corrections):

for tree, leaf_values in estimators:
    new_tree = copy.deepcopy(tree)
    assert new_tree.tree_.value.shape == (len(leaf_values), 1, 1)
    new_tree.tree_.value[:, 0, 0] = leaf_values
    <save new tree to the ensemble>

Don't forget to verify you get the same predictions before / after conversion

arogozhnikov added the question label May 12, 2016

arogozhnikov changed the title ~~Exporting the reweighting formula~~ Exporting/saving/reusing the reweighting formula May 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting/saving/reusing the reweighting formula #33

Exporting/saving/reusing the reweighting formula #33

bifani commented May 11, 2016

arogozhnikov commented May 12, 2016 •

edited

Loading

bifani commented May 12, 2016

kpedro88 commented Oct 3, 2018

arogozhnikov commented Oct 5, 2018

Exporting/saving/reusing the reweighting formula #33

Exporting/saving/reusing the reweighting formula #33

Comments

bifani commented May 11, 2016

arogozhnikov commented May 12, 2016 • edited Loading

Working from the same script

When you need to store formula

Exporting to TMVA

bifani commented May 12, 2016

kpedro88 commented Oct 3, 2018

arogozhnikov commented Oct 5, 2018

arogozhnikov commented May 12, 2016 •

edited

Loading