Bootstrapping: should it be at the alchemlyb level, or pymbar level? #270

mrshirts · 2019-05-08T20:54:22Z

mrshirts
May 8, 2019
Maintainer

Should bootstrapping be implemented at the alchemlyb level, or pymbar level? For MBAR, it would be better at the pymbar level, since it can be easily encapsulated (user doesn't have to worry about it), and one can request either uncertainty estimate.

For BAR over several states, then the bootstrapping needs to be done at the level ABOVE the BAR call, since we need to bootstrap all of the data simultaneously before feeding it into BAR. Same for EXP applied to a string of states.

Thoughts?

orbeckst · 2019-05-10T03:58:27Z

orbeckst
May 10, 2019
Maintainer

If there's a common function that we can use for "everything" (and just plugin the estimator) then alchemlyb would be a good place, I think – along the lines that we only have to write and test the code once.

I generally like building blocks that I can freely combine. Something like

def bootstrapped(data, estimator):
     ...
     return mean, error

Alternatively, we could hide the machinery in the Estimator.fit() function: enable it with fit(..., ..., bootstrap=True) and just do under the hood whatever needs to be done and set the appropriate output attributes. In this case, one might just use a pymbar bootstrap directly if available or roll our own for other estimators.

One advantage of doing it at the alchemlyb level is that it might not be difficult to run alchemlyb with dask (essentially, use the dask.DataFrame) and then the bootstrapping can be parallelized without effort. A while ago @dotsdl played around with alchemlyb and dask – I can't quite remember how much would need to be changed.

0 replies

mrshirts · 2019-05-10T04:44:42Z

mrshirts
May 10, 2019
Maintainer Author

def bootstrapped(data, estimator):

That might be a good idea. It could return a dictionary of all the bootstrapped results, along with the uncertainty estimate. I'll think about how to organize this.

One issue is that the data will look different with each estimator, thus requiring fairly different conditionals inside the bootstrapped data. Also, if one was analyzing K states, calculating the free energy with BAR executed pairwise, one would want to bootstrap over the entire data set of K states; i.e. you would need to bootstrap the entire procedure, not over a single estimator.

0 replies

dotsdl · 2019-07-24T20:51:48Z

dotsdl
Jul 24, 2019
Maintainer

@mrshirts do you have a paper or writeup you can point to for this approach? I'd be happy to prototype something. We may be able to steal design inspiration from scikit-learn such as their GridSearchCV and Pipeline objects.

0 replies

mrshirts · 2019-07-27T07:43:52Z

mrshirts
Jul 27, 2019
Maintainer Author

So, I don't really have a good simple paper. http://www.alchemistry.org/wiki/Analyzing_Simulation_Results#Bootstrap_Sampling is a good summary.

I agree that something like sklearn that carries out some algorithm given an input function. In the case of bootstrapping, the metamethod takes some function of N arrays, each of which is a a set of samples from an independent distribution (or independent draw of the same distribution). Say, for example, N sets of samples, one at each lambda state; for each bootstrap you would want to bootstrap sample all N of the samples each time. (i.e. always bootstrap over ALL input data before any analysis).

After the bootstrap sampling with replacement, everything else is pretty trivial. You calculate your function on each of the bootstrapped. You then have a set of resullts (could be multivalue return, and you can simply return a list of all the answers. You can optionally return various statistical measures of this list for each of the results - mean, standard deviation, confidence intervals.

One could make decorrelation of the data sets part of the algorithm, but it would perhaps be more modular to do the decorrelation as a separate step.

0 replies

wildromi · 2020-01-17T16:23:53Z

wildromi
Jan 17, 2020

Dear alchemlyb team!
It is quite awesome to have this library. We will use it in a publication I am currently preparing. For this it is crucial to have an error measure, and a built in function that carries out bootstrapping and returns the bootstrap standard deviation would be the best. Maybe one could also access the bootstrap data sets used to calculate st dev, but that is not too important. I cannot code but gladly test anything you might have. Thank you so much!
Romina

0 replies

dotsdl · 2020-01-18T01:19:38Z

dotsdl
Jan 18, 2020
Maintainer

Hey all, after discussions with @wildromi, I've committed to working on this issue over the next two weeks. I expect the first iteration to be usable but probably not the approach we end up with. I'll post a WIP PR as soon as I can.

0 replies

mrshirts · 2020-01-18T03:03:02Z

mrshirts
Jan 18, 2020
Maintainer Author

Hi, David- I'd love to talk some more about this, as I've been dealing with similar setups for a while. Shoot me an email at the CU email and we can strategize some more? A key is bootstrapping simultaneously over multiple time series, for example.

0 replies

dotsdl · 2020-01-18T05:43:53Z

dotsdl
Jan 18, 2020
Maintainer

@mrshirts sent! I'm looking forward to leveraging your experience to jumpstart the approach.

0 replies

mrshirts · 2020-01-19T20:28:50Z

mrshirts
Jan 19, 2020
Maintainer Author

@dotsdl: take a look at https://github.com/choderalab/pymbar/blob/pmf/pymbar/pmf.py and look at lines 590 to 615 to get a sense at how bootstrapping works in a complicated case (in this case, calculating a potential of mean force)

0 replies

dotsdl · 2020-01-21T07:12:18Z

dotsdl
Jan 21, 2020
Maintainer

I met with @mrshirts yesterday, and we aligned on an approach. I have started a WIP PR on #94. There is a list of things to do yet, but we have the start of our implementation. You can check out how things work so far in this gist. Comments welcome!

Please don't use this in production work yet until we have tests ensuring that bootstrap is working as expected. It appears to work from my ad-hoc tests, so at most give it a try on the feature-bootstrapping branch.

0 replies

dotsdl · 2020-01-29T06:01:11Z

dotsdl
Jan 29, 2020
Maintainer

The gist for #94 has been updated; it requires components of #98, which can be played with on this branch.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrapping: should it be at the alchemlyb level, or pymbar level? #270

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Bootstrapping: should it be at the alchemlyb level, or pymbar level? #270

mrshirts May 8, 2019 Maintainer

Replies: 11 comments

orbeckst May 10, 2019 Maintainer

mrshirts May 10, 2019 Maintainer Author

dotsdl Jul 24, 2019 Maintainer

mrshirts Jul 27, 2019 Maintainer Author

wildromi Jan 17, 2020

dotsdl Jan 18, 2020 Maintainer

mrshirts Jan 18, 2020 Maintainer Author

dotsdl Jan 18, 2020 Maintainer

mrshirts Jan 19, 2020 Maintainer Author

dotsdl Jan 21, 2020 Maintainer

dotsdl Jan 29, 2020 Maintainer

mrshirts
May 8, 2019
Maintainer

orbeckst
May 10, 2019
Maintainer

mrshirts
May 10, 2019
Maintainer Author

dotsdl
Jul 24, 2019
Maintainer

mrshirts
Jul 27, 2019
Maintainer Author

wildromi
Jan 17, 2020

dotsdl
Jan 18, 2020
Maintainer

mrshirts
Jan 18, 2020
Maintainer Author

dotsdl
Jan 18, 2020
Maintainer

mrshirts
Jan 19, 2020
Maintainer Author

dotsdl
Jan 21, 2020
Maintainer

dotsdl
Jan 29, 2020
Maintainer