Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do simplify() in steps #24

Closed
petrelharp opened this issue Apr 24, 2017 · 11 comments
Closed

do simplify() in steps #24

petrelharp opened this issue Apr 24, 2017 · 11 comments

Comments

@petrelharp
Copy link
Collaborator

Currently we record all of history and then simplify() it all at once. This ends up taking rather a lot of memory. As discussed with @jeromekelleher, we ought to be able to do this stepwise. To do this, we'll need to make sure there's room enough to sample the entire population, and then:

  1. add_samples(ids) where ids is the current generation, and remember that ids -> range(len(ids))
  2. simplify()
  3. restart the argrecorder and remember that range(len(ids)) -> initial_gen
  4. run for a while longer
  5. append results to tree sequence and return to 1

To append more things to a tree sequence we need to:

  1. shift time in the NodeTable
  2. move the samples to the end as new IDs
  3. append new nodes and edgesets while remapping the IDs by adding a constant to all IDs and remapping initial_gen to whatever they should be
@jeromekelleher
Copy link

How much simpler would it be if IDs didn't have to be 0..n - 1? I can probably remove this requirement fairly quickly if you're happy with possible bugs that may pop up. Simplify will probably still map the samples back to 0... n -1 though --- changing this wouldn't be impossible, just a bit more work as we'd have to change the interface a bit.

@petrelharp
Copy link
Collaborator Author

petrelharp commented Apr 24, 2017 via email

@jeromekelleher
Copy link

OK, I'll have a look this evening to see if it can be done reasonably easily. I'll let you know either way.

@ashander
Copy link
Owner

This change could improve performance a lot, hopefully. Results in #25 suggest a most time is spent merging records, which scales with the number of nodes as ___ (TBD) and number of edgesets as (TBD).

Also, as @petrelharp and I were discussing today, this change is equivalent to:

  1. running a simulation on a population p from time t0 to t1 and using our alg to record the ARG
  2. simplifying the recorded ARG to make a tree sequence T1
  3. running the simulation further from t1 to t2
  4. simplifying the second recorded ARG to make a tree sequence T2
    5 adding T1 and T2 together --- if there is a well defined way to "add" tree sequences.

we're thinking that we can define a clear way to "add" two tree sequences T1 and T2, given: a mapping of node numbers in T2 to those in T1, and an offset relating times between the tree sequences. (note: I'm writing "add" because the operator would be non-commutative and assume that T1 is earlier in time than T2.)

@petrelharp
Copy link
Collaborator Author

Hm, I'm not sure:

  • if I read them right, those results suggest that only about 30% of the time is in add_record: an improvement, but not the order of magnitude we're looking for there.

  • I don't think merging in steps would reduce the work of add_record, because once an individual dies their entry in the argrecorder's dict does not change; so having more history around does not make this task more complex. The dict does grow in size, so if that causes python to go slower that would be an issue, but I don't think it is a serious one?

This change would reduce memory usage (which is being a problem), but that's not what's leading to the slowdown we see over there.

Or am I missing something?

@ashander
Copy link
Owner

😬 yes the results in #25 aren't a clear case for this. (thought it's still probably good to do.) my thought was that as dict size increases the operations on the dict slow down, possibly to a crawl. but need to do more profiling across a range of pop sizes to see this. I'll do that comment more over in #25

@petrelharp
Copy link
Collaborator Author

petrelharp commented Apr 26, 2017 via email

@jeromekelleher
Copy link

Just butting in here:

  1. simplifying the second recorded ARG to make a tree sequence T2
  2. adding T1 and T2 together --- if there is a well defined way to "add" tree sequences.

Is adding T1 and T2 together the right way to do this? I had imagined that the same tree sequence would be incrementally updated at each step. So, it would be something like:

  1. T = empty tree sequence
  2. Run the simulation for delta_t, adding records into T. t += delta_t
  3. T = T.simplify()
  4. if t < t_max, goto 2.

This seems a bit simpler to me. I don't think there will be much overhead in re-simplifying the older bits of T repeatedly, and it avoids the complexity of figuring out how to add two tree sequences together.

@petrelharp
Copy link
Collaborator Author

I agree with Jerome, which is also what I wrote initially. I hadn't caught that Jaime suggested something different.

@ashander
Copy link
Owner

Peter's original suggestion and Jerome's elaboration (no worries @jeromekelleher your input is always more than welcome!) is a good way to go for the change suggested in this issue.

My comment left some context out and could have been clearer. I should've mentioned that, for other reasons, it might be useful to define a way to "add" two tree sequences. For example if we wanted to run coalescent sims in reverse time, then run many replicate forward sims from that starting point. But implementing "add" is clearly not necessary to do simplify in steps.

@ashander
Copy link
Owner

But implementing "add" is clearly not necessary to do simplify in steps.

though it turned out to be useful :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants