-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nuclinfo Major Pair and Minor Pair overhaul #3735
Nuclinfo Major Pair and Minor Pair overhaul #3735
Conversation
Hello @ALescoulie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2023-10-10 06:31:35 UTC |
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #3735 +/- ##
==========================================
Coverage 93.40% 93.41%
==========================================
Files 170 184 +14
Lines 22257 23394 +1137
Branches 4071 4075 +4
==========================================
+ Hits 20790 21854 +1064
- Misses 951 1024 +73
Partials 516 516
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry only had the time/energy for a quick review.
Couple of extra things I couldn't add comments for but worth mentioning since it seems like there's going to be further changes to this code:
- lines 132, 139, and 142 - we already populate a times array under self.times, is this still necessary to populate? Can we just do a call to
np.array(self.times)
? - The Attributes for NucPairDist aren't indented properly, could you change them here?
- Dicussion point (other @MDAnalysis/coredevs please weigh in here), the way the
results
dictionary is being used here is very different from how we do results everywhere else & imho makes it a bit hard to specifically extract the pairs of distances vs say the times entry. I realise it's a bit late since this already shipped in 2.2.0, but before we keep going, would it make sense to consider switching this to instead beresults.pair_distances
(maybe even making it a pairs x times 2D ndarray?
Bases are matched by their index in the lists given as arguments. | ||
|
||
Parameters | ||
__________ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(here and everywhere else) I know these still render, but the line style is very much different from the numpy-like style guide we use. Can you instead replace this with a ------
line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in a recent commit
for s in strand: | ||
if s[0].resname[0] in [c_name, t_name, u_name]: | ||
a1, a2 = o2_name, c2_name | ||
elif s[0].resname[0] in [a_name, g_name]: | ||
a1, a2 = c2_name, o2_name | ||
else: | ||
raise ValueError(f"{s} are not valid nucleic acids") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm misreading, this code block is essentially being repeated in all the child classes. Could you just use a staticmethod here that generalises this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then you can test it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored it into the NucPairDist
parent class like @IAlibay recommended
else: | ||
raise ValueError(f"{s} are not valid nucleic acids") | ||
|
||
sel1.append(s[0].atoms.select_atoms(f'name {a1}')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can double check this on your end but testing things out you'd probably get a ~ 5x speedup if you did
s[0].atoms[np.where(s[0].atoms.names == a1)]
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious if this was still true after the string interning changes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardjgowers PLEASE write a benchmark case to be run for anything that you improve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been bench-marked yet? If not I can set one up and adjust the selection code accordingly.
elif s[0].resname[0] in [a_name, g_name]: | ||
a1, a2 = c2_name, o2_name | ||
else: | ||
raise ValueError(f"{s} are not valid nucleic acids") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please cover these errors with tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test in the refactor of the selection into NucPairDist
Attributes | ||
__________ | ||
results: numpy.ndarray | ||
first index is selection second index is time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing indentation it seems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
raise ValueError(f"{s} are not valid nucleic acids") | ||
|
||
sel1.append(s[0].atoms.select_atoms(f'name {a1}')) | ||
sel2.append(s[1].atoms.select_atoms(f'name {a2}')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here if one of the selections returns an empty atomgroup and the other doesn't? (for some reason or another, maybe the wrong name selection was passed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check for that in the static method in NucPairDist
@@ -56,6 +64,7 @@ | |||
from .distances import calc_bonds | |||
from .base import AnalysisBase, Results | |||
from MDAnalysis.core.groups import Residue | |||
from .dihedrals import Dihedral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably missing something obvious, where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that didn't get cut when I moved a commit around. I was working on the rebuilding another the torsion calculator and had imported that, I'll remove that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in latest push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of extra things since I was playing with pytest.approx
from MDAnalysis.analysis.nucleicacids import WatsonCrickDist,\ | ||
MajorPairDist, MinorPairDist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't get picked up by pep8speaks, but implicit continuation is best where possible
from MDAnalysis.analysis.nucleicacids import WatsonCrickDist,\ | |
MajorPairDist, MinorPairDist | |
from MDAnalysis.analysis.nucleicacids import (WatsonCrickDist, | |
MajorPairDist, MinorPairDist,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the issue in my latest push
MI = MinorPairDist(strand1, strand2) | ||
MI.run() | ||
|
||
assert_allclose(MI.results[0][0], 15.06506, atol=1e-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another one for @MDAnalysis/coredevs to consider for our testing style going ahead - single float comparisons are ~ 20x faster (at least on my laptop) if done using pytest.approx
instead of assert_allclose
. It's microseconds (170 microsec vs 9 microsec), but it adds up quickly to seconds, how do we feel about making pytest.approx the assert method of choice in those cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised #3743 for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored my test to use pytest approx
package/CHANGELOG
Outdated
@@ -27,6 +27,8 @@ Enhancements | |||
* Additional logger.info output when per-frame analysis starts (PR #3710) | |||
* Timestep has been converted to a Cython Extension type | |||
(CZI Performance track, PR #3683) | |||
* Add higher performance AnalysisBase derived Major and Minor pair distances to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually go newest first for these entries, could you move it to the top of the enhancements list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Refactoring old code is never pleasant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the rewrite.
I have a bunch of readability comments (see inline).
I agree with @IAlibay that we should remove results['times']
. I also agree that results
needs restructuring, both for consistency but also for ease of accessing the data with e.g. mdacli. Let's raise a separate issue where others can also easily chime in. Normally we adhere to semantic versioning. However, I would consider this module still experimental and I am willing to break strict semver here and just change results
without deprecations. The changes to results can go into a separate PR (preferrable) or this one (if a lot easier).
ValueError | ||
if the residues given are not amino acids | ||
ValueError | ||
if the selections given are not the same length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
""" | ||
|
||
def __init__(self, strand1: List[Residue], strand2: List[Residue], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union(List[Residue], ResidueGroup)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that, I looked at ResidueGroup
and there is no reason NucPairDist
can't accept it or Residue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a type alias to do it so that the code is less verbose, what do you think about that solution? It is a bit less clear in the docs, but simplifies the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what a type alias is and where you used it — sorry, apologies for my ignorance. Can you please explain and comment on the corresponding piece of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least from the docs it is not clear to me that the class would now also accept strand1: ResidueGroup
— ... at least from your comment I surmise that's what should be possible now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orbeckst A type alias is when you give a specific name to a composite type in my case ResidueClass
is an alias for Union[Residue, ResidueGroup]
so my code's type signatures are a bit less verbose
**kwargs) -> None: | ||
sel1: List[mda.AtomGroup] = [] | ||
sel2: List[mda.AtomGroup] = [] | ||
strand = zip(strand1, strand2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plural "strands" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps clearer, double_strand
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set it to strands
Before going much further here we should discuss the format of |
@ALescoulie It would be great to not let this work get forgotten, are you still looking to contribute this? |
Yeah, I'll continue working on this. Sorry for abandoning it, I was busy with school and dealing with mental health issues. I'm getting the development repo set up on my desktop now. |
Hi @ALescoulie good to hear from you! Would be fantastic to get your work in! |
Linter Bot Results:Hi @ALescoulie! Thanks for making this PR. We linted your code and found the following: Some issues were found with the formatting of your code.
Please have a look at the Please note: The |
I pushed a larger change to the PR. It was a bit hard picking up where I left off, but I was able to jump back into developing without too many issues. I refactored my code so that all of the Pair distance analysis classes use a TODO
|
Co-authored-by: Oliver Beckstein <[email protected]>
descriptions and fix misc typos
strand1 and strand2 are different types
simplify type checks and catch TypeError
@orbeckst I believe I addressed your feedback on the documentation, and added some further clarity. I also rewrote how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments. The only real issue is that we need to deprecate results.pair_distance
in the WatsonCrick class properly.
In the new classes there's nothing to deprecate as they appear here for the first time.
|
||
.. versionchanged:: 2.5.0 | ||
Accessing results by passing strand indices to :attr:`results` is | ||
was deprecated and is now removed as of MDAnalysis version 2.5.0. Please | ||
use :attr:`results.pair_distances` instead. | ||
The :attr:`results.times` was deprecated and is now removed as of | ||
MDAnalysis 2.5.0. Please use the class attribute :attr:`times` instead. | ||
|
||
.. versionchanged:: 2.7.0 | ||
.. _Deprecation Notice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above : does the anchor work, did you check the generated docs?
`pair_distances` in WCDist
verification TypeError
@orbeckst I think this PR about ready, assuming I didn't miss any typos in the docs. I also pulled most of the atom selections in the tests into a fixture so there is less repetition when running the CI. |
- use deprecated sphinx role - fixed minor errors in descriptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ALescoulie , looks good.
I made some changes to the docs, mainly to use sphinx markup for deprecations, but I also included a few fixes. In particular, please have a look for yourself that I correctly defined results.distances
.
Distances are stored in a 2d numpy array with axis 0 (first index) | ||
indexing the trajectory frame and axis 1 (second index) selecting the | ||
Residue pair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ALescoulie please check that this rewritten description is correct. It is now consistent with the description of results.pair_distances
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle all good, but please fix PEP8 things.
06a55cb
to
ee44dca
Compare
@orbeckst I got the pep8 stuff done |
Hooray. I reopen #3720 for the other things. |
Fixes #3720
Changes made in this Pull Request:
PR Checklist