Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoc #89

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Autoc #89

wants to merge 7 commits into from

Conversation

wdwvt1
Copy link
Contributor

@wdwvt1 wdwvt1 commented Feb 10, 2017

No description provided.

wdwvt1 and others added 7 commits September 14, 2016 01:51
This wrings another ~8% performance boost from a new way to draw random numbers.
This commit creates functionality for the python API. Specifically, `gibbs` can now be allowed to auto-converge where it will continue sampling until it converges or reaches a fixed number of samples (defined by the user.
@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Feb 10, 2017

hmm, not sure how i got all these other commits in there. this is for initial review and then we can discuss parameters.

@coveralls
Copy link

Coverage Status

Coverage decreased (-7.9%) to 91.29% when pulling 05771a8 on wdwvt1:autoc into f01d11b on biota:master.

@lkursell
Copy link
Collaborator

@gregcaporaso Can you review this pull request?

@gregcaporaso
Copy link
Member

I can review this over the weekend.

Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wdwvt1, some comments on this. Overall it looks reasonable.

@@ -379,11 +379,11 @@ def __init__(self, alpha1, alpha2, beta, source_data):
self.joint_probability = np.zeros(self.V, dtype=np.float64)

def set_n(self, n):
"""Set the sum of the sink."""
'''Set the sum of the sink.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A setter like this isn't necessary, since n is a public property and can be set directly. I recommend dropping the setter, but not a big deal.


If `autoc` is not `None`, the meaning of the `draws_per_restart` parameter
changes. If `autoc` is `None`, `draws_per_restart` draws will be taken. If
`True`, this function will check for convergence in the number of counts in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is a bit confusing. It reads like autoc takes either None or True.

I would change this line to say: a function is provided, it will be applied to check...

@@ -478,22 +501,36 @@ def gibbs_sampler(sink, cp, restarts, draws_per_restart, burnin, delay):
ordering (same ordering as `final_env_assignments`). The [i, j] entry
is the environment that the taxon `final_env_assignments[i, j]` is
determined to have come from in draw i (j is the environment).
"""
times : np.array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change represents a backward-incompatible interface change, so should probably be noted in the CHANGELOG.md file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're changing the interface anyway, it might make sense for this to just return a NamedTuple object - that's nicer for handling many return values (it's a bit more self-documenting, but since it's still a tuple code that calls this function shouldn't need to be updated).

@@ -462,6 +479,12 @@ def gibbs_sampler(sink, cp, restarts, draws_per_restart, burnin, delay):
additional samples will be drawn every `delay` number of passes. This
is also known as 'thinning'. Thinning helps reduce the impact of
correlation between adjacent states of the Markov chain.
autoc : function or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does autoc stand for? Can you not abbreviate it for clarity?

autoc : function or None
Determines what mode the function will be run in. If not `None`,
`autoc` will be used to determine if the chain has converged. `autoc`
must accept two arguments; the `final_envcounts` array and an index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's passed three arguments below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: autoc should return a bool indicating if convergence has been achieved (True) or not (False).

# update counts for each environment and the unknown source
# if necessary.
new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
cs = jp.cumsum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from jp.sum() to jp.cumsum()? (Just want to make sure that that was intentional.)

return (final_envcounts, final_env_assignments,
final_taxon_assignments, times, conv_chain_fraction)

else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really shouldn't have different collections of values in the return here, it makes code that uses this much harder to work with as the calling function has to anticipate what it's going to get back. It would be much better to return an empty np.array in place of times, and a float that makes sense (0.0?) in place of conv_chain_fraction. If you don't want to do that, you should split this entire function into two different functions corresponding to the different behavior. For example, gibbs_sampler and gibbs_sampler_autoc.


Returns
-------
int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a bool, not an int.

convs = pd.DataFrame(np.array([i[3] for i in results]),
index=sinks.index, columns=cols)
convs['converging_fraction'] = [i[4] for i in results]
return mpm, mps, fas, convs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about not changing the number of values returned from the function. I highly recommend not going with this approach.

###############################################################################


def method5(sources, sinks):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more descriptive name that could be used here?

@gregcaporaso
Copy link
Member

@wdwvt1, it looks like this was built on a branch containing the commits in #83, which is why those commits are in here. This therefore seems to contain all of the changes in that PR as well. It's generally good practice to keep different changes in different PRs - if you want to go with that approach, you could git checkout master, git pull upstream master, and then copy you changes over to a new branch and submit that in a new PR.

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Feb 18, 2017 via email

@lkursell
Copy link
Collaborator

@wdwvt1 - any chance to address @gregcaporaso's comments to implement the autoconvergence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants