New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Autoc #89

Open

wdwvt1 wants to merge 7 commits into caporaso-lab:master from wdwvt1:autoc

Contributor

wdwvt1 commented Feb 10, 2017

No description provided.

wdwvt1 and others added 7 commits

September 14, 2016 01:51


          ~45% speed improvement


          Remove unnecessary scaling

This wrings another ~8% performance boost from a new way to draw random numbers.


          Adds 'method5' as a mixing method

d36784f


          Adds unit tests for method5

74f2c66


          Utilize validate_gibbs_input to check input dfs

740a501


          Updates to be flake8 compliant

4e512d6


          Working auto convergence

05771a8

This commit creates functionality for the python API. Specifically, `gibbs` can now be allowed to auto-converge where it will continue sampling until it converges or reaches a fixed number of samples (defined by the user.

Contributor Author

wdwvt1 commented Feb 10, 2017

hmm, not sure how i got all these other commits in there. this is for initial review and then we can discuss parameters.

coveralls commented Feb 10, 2017

Coverage decreased (-7.9%) to 91.29% when pulling 05771a8 on wdwvt1:autoc into f01d11b on biota:master.

Collaborator

lkursell commented Feb 13, 2017

@gregcaporaso Can you review this pull request?

Member

gregcaporaso commented Feb 16, 2017

I can review this over the weekend.

gregcaporaso requested changes

View reviewed changes

Member

gregcaporaso left a comment

@wdwvt1, some comments on this. Overall it looks reasonable.

sourcetracker/_sourcetracker.py

    
            @@ -379,11 +379,11 @@ def __init__(self, alpha1, alpha2, beta, source_data):
          
                      self.joint_probability = np.zeros(self.V, dtype=np.float64)

                  def set_n(self, n):

                      """Set the sum of the sink."""

                      '''Set the sum of the sink.'''

Member

gregcaporaso Feb 18, 2017

A setter like this isn't necessary, since n is a public property and can be set directly. I recommend dropping the setter, but not a big deal.

sourcetracker/_sourcetracker.py

+                  If `autoc` is not `None`, the meaning of the `draws_per_restart` parameter
+                  changes. If `autoc` is `None`, `draws_per_restart` draws will be taken. If
+                  `True`, this function will check for convergence in the number of counts in

Member

gregcaporaso Feb 18, 2017

This description is a bit confusing. It reads like autoc takes either None or True.

I would change this line to say: a function is provided, it will be applied to check...

sourcetracker/_sourcetracker.py

                       ordering (same ordering as `final_env_assignments`). The [i, j] entry
                       is the environment that the taxon `final_env_assignments[i, j]` is
                       determined to have come from in draw i (j is the environment).
-                  """
+                  times : np.array

Member

gregcaporaso Feb 18, 2017

This change represents a backward-incompatible interface change, so should probably be noted in the CHANGELOG.md file.

Member

gregcaporaso Feb 18, 2017

Since you're changing the interface anyway, it might make sense for this to just return a NamedTuple object - that's nicer for handling many return values (it's a bit more self-documenting, but since it's still a tuple code that calls this function shouldn't need to be updated).

sourcetracker/_sourcetracker.py

                       additional samples will be drawn every `delay` number of passes. This
                       is also known as 'thinning'. Thinning helps reduce the impact of
                       correlation between adjacent states of the Markov chain.
+                  autoc : function or None

Member

gregcaporaso Feb 18, 2017

What does autoc stand for? Can you not abbreviate it for clarity?

sourcetracker/_sourcetracker.py

+                  autoc : function or None
+                      Determines what mode the function will be run in. If not `None`,
+                      `autoc` will be used to determine if the chain has converged. `autoc`
+                      must accept two arguments; the `final_envcounts` array and an index

Member

gregcaporaso Feb 18, 2017

It looks like it's passed three arguments below.

Member

gregcaporaso Feb 18, 2017

Add: autoc should return a bool indicating if convergence has been achieved (True) or not (False).

sourcetracker/_sourcetracker.py

-                              # update counts for each environment and the unknown source
-                              # if necessary.
-                              new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
+                              cs = jp.cumsum()

Member

gregcaporaso Feb 18, 2017

Why the change from jp.sum() to jp.cumsum()? (Just want to make sure that that was intentional.)

sourcetracker/_sourcetracker.py

+                      return (final_envcounts, final_env_assignments,
+                              final_taxon_assignments, times, conv_chain_fraction)
+                  else:

Member

gregcaporaso Feb 18, 2017

You really shouldn't have different collections of values in the return here, it makes code that uses this much harder to work with as the calling function has to anticipate what it's going to get back. It would be much better to return an empty np.array in place of times, and a float that makes sense (0.0?) in place of conv_chain_fraction. If you don't want to do that, you should split this entire function into two different functions corresponding to the different behavior. For example, gibbs_sampler and gibbs_sampler_autoc.

sourcetracker/_sourcetracker.py

+                  Returns
+                  -------
+                  int

Member

gregcaporaso Feb 18, 2017

This should be a bool, not an int.

sourcetracker/_sourcetracker.py

+                      convs = pd.DataFrame(np.array([i[3] for i in results]),
+                                           index=sinks.index, columns=cols)
+                      convs['converging_fraction'] = [i[4] for i in results]
+                      return mpm, mps, fas, convs

Member

gregcaporaso Feb 18, 2017

Same comment as above about not changing the number of values returned from the function. I highly recommend not going with this approach.

sourcetracker/_sourcetracker.py

		###############################################################################


		def method5(sources, sinks):

Member

gregcaporaso Feb 18, 2017

Is there a more descriptive name that could be used here?

Member

gregcaporaso commented Feb 18, 2017

@wdwvt1, it looks like this was built on a branch containing the commits in #83, which is why those commits are in here. This therefore seems to contain all of the changes in that PR as well. It's generally good practice to keep different changes in different PRs - if you want to go with that approach, you could git checkout master, git pull upstream master, and then copy you changes over to a new branch and submit that in a new PR.

Contributor Author

wdwvt1 commented Feb 18, 2017 via email

you are tottaly right - i am not sure how i managed to do that :/. thanks for all the comments! i will clean these up and submit a new PR.

On Sat, Feb 18, 2017 at 12:36 PM, Greg Caporaso ***@***.***> wrote: @wdwvt1 <https://github.com/wdwvt1>, it looks like this was built on a branch containing the commits in #83 <#83>, which is why those commits are in here. This therefore seems to contain all of the changes in that PR as well. It's generally good practice to keep different changes in different PRs - if you want to go with that approach, you could git checkout master, git pull upstream master, and then copy you changes over to a new branch and submit that in a new PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#89 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA__-bzJhZKcfRHLJdY3P4E0Il8zuexMks5rd1Y6gaJpZM4L9Z5f> .

Collaborator

lkursell commented Jul 25, 2018

@wdwvt1 - any chance to address @gregcaporaso's comments to implement the autoconvergence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet