Nuclinfo Major Pair and Minor Pair overhaul #3735

ALescoulie · 2022-06-26T18:35:23Z

Fixes #3720

Changes made in this Pull Request:

Add a class for Major Pair and Minor Pair distance in the new overhauled nucleic acids module.

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

pep8speaks · 2022-06-26T18:35:26Z

Hello @ALescoulie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2023-10-10 06:31:35 UTC

codecov · 2022-06-26T18:57:12Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (427f1a7) 93.40% compared to head (ee44dca) 93.41%.

Additional details and impacted files

@@            Coverage Diff             @@
##           develop    #3735     +/-   ##
==========================================
  Coverage    93.40%   93.41%             
==========================================
  Files          170      184     +14     
  Lines        22257    23394   +1137     
  Branches      4071     4075      +4     
==========================================
+ Hits         20790    21854   +1064     
- Misses         951     1024     +73     
  Partials       516      516

Files	Coverage Δ
package/MDAnalysis/analysis/nucleicacids.py	`100.00% <100.00%> (ø)`

... and 14 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

IAlibay

Sorry only had the time/energy for a quick review.

Couple of extra things I couldn't add comments for but worth mentioning since it seems like there's going to be further changes to this code:

lines 132, 139, and 142 - we already populate a times array under self.times, is this still necessary to populate? Can we just do a call to np.array(self.times)?
The Attributes for NucPairDist aren't indented properly, could you change them here?
Dicussion point (other @MDAnalysis/coredevs please weigh in here), the way the results dictionary is being used here is very different from how we do results everywhere else & imho makes it a bit hard to specifically extract the pairs of distances vs say the times entry. I realise it's a bit late since this already shipped in 2.2.0, but before we keep going, would it make sense to consider switching this to instead be results.pair_distances (maybe even making it a pairs x times 2D ndarray?

IAlibay · 2022-06-27T20:40:04Z

package/MDAnalysis/analysis/nucleicacids.py

+    Bases are matched by their index in the lists given as arguments.
+
+    Parameters
+    __________


(here and everywhere else) I know these still render, but the line style is very much different from the numpy-like style guide we use. Can you instead replace this with a ------ line?

fixed in a recent commit

IAlibay · 2022-06-27T20:50:20Z

package/MDAnalysis/analysis/nucleicacids.py

+        for s in strand:
+            if s[0].resname[0] in [c_name, t_name, u_name]:
+                a1, a2 = o2_name, c2_name
+            elif s[0].resname[0] in [a_name, g_name]:
+                a1, a2 = c2_name, o2_name
+            else:
+                raise ValueError(f"{s} are not valid nucleic acids")


Unless I'm misreading, this code block is essentially being repeated in all the child classes. Could you just use a staticmethod here that generalises this code?

yes, please 👍

And then you can test it separately.

refactored it into the NucPairDist parent class like @IAlibay recommended

IAlibay · 2022-06-27T21:00:05Z

package/MDAnalysis/analysis/nucleicacids.py

+            else:
+                raise ValueError(f"{s} are not valid nucleic acids")
+
+            sel1.append(s[0].atoms.select_atoms(f'name {a1}'))


You can double check this on your end but testing things out you'd probably get a ~ 5x speedup if you did
s[0].atoms[np.where(s[0].atoms.names == a1)] instead

I'd be curious if this was still true after the string interning changes...

@richardjgowers PLEASE write a benchmark case to be run for anything that you improve

Has this been bench-marked yet? If not I can set one up and adjust the selection code accordingly.

IAlibay · 2022-06-27T21:00:20Z

package/MDAnalysis/analysis/nucleicacids.py

+            elif s[0].resname[0] in [a_name, g_name]:
+                a1, a2 = c2_name, o2_name
+            else:
+                raise ValueError(f"{s} are not valid nucleic acids")


Also please cover these errors with tests

I test in the refactor of the selection into NucPairDist

IAlibay · 2022-06-27T21:06:21Z

package/MDAnalysis/analysis/nucleicacids.py

+    Attributes
+    __________
+    results: numpy.ndarray
+    first index is selection second index is time


This is missing indentation it seems?

IAlibay · 2022-06-27T21:44:41Z

package/MDAnalysis/analysis/nucleicacids.py

+                raise ValueError(f"{s} are not valid nucleic acids")
+
+            sel1.append(s[0].atoms.select_atoms(f'name {a1}'))
+            sel2.append(s[1].atoms.select_atoms(f'name {a2}'))


What happens here if one of the selections returns an empty atomgroup and the other doesn't? (for some reason or another, maybe the wrong name selection was passed?)

I added a check for that in the static method in NucPairDist

IAlibay · 2022-06-27T21:54:22Z

package/MDAnalysis/analysis/nucleicacids.py

@@ -56,6 +64,7 @@
 from .distances import calc_bonds
 from .base import AnalysisBase, Results
 from MDAnalysis.core.groups import Residue
+from .dihedrals import Dihedral


Probably missing something obvious, where is this used?

Oh that didn't get cut when I moved a commit around. I was working on the rebuilding another the torsion calculator and had imported that, I'll remove that

removed in latest push

IAlibay

Couple of extra things since I was playing with pytest.approx

IAlibay · 2022-06-27T22:10:57Z

testsuite/MDAnalysisTests/analysis/test_nucleicacids.py

+from MDAnalysis.analysis.nucleicacids import WatsonCrickDist,\
+    MajorPairDist, MinorPairDist


It doesn't get picked up by pep8speaks, but implicit continuation is best where possible

Suggested change

from MDAnalysis.analysis.nucleicacids import WatsonCrickDist,\

MajorPairDist, MinorPairDist

from MDAnalysis.analysis.nucleicacids import (WatsonCrickDist,

MajorPairDist, MinorPairDist,)

fixed the issue in my latest push

IAlibay · 2022-06-27T22:25:35Z

testsuite/MDAnalysisTests/analysis/test_nucleicacids.py

+    MI = MinorPairDist(strand1, strand2)
+    MI.run()
+
+    assert_allclose(MI.results[0][0], 15.06506, atol=1e-3)


another one for @MDAnalysis/coredevs to consider for our testing style going ahead - single float comparisons are ~ 20x faster (at least on my laptop) if done using pytest.approx instead of assert_allclose. It's microseconds (170 microsec vs 9 microsec), but it adds up quickly to seconds, how do we feel about making pytest.approx the assert method of choice in those cases?

I raised #3743 for you.

I refactored my test to use pytest approx

IAlibay · 2022-06-27T22:26:44Z

package/CHANGELOG

@@ -27,6 +27,8 @@ Enhancements
  * Additional logger.info output when per-frame analysis starts (PR #3710)
  * Timestep has been converted to a Cython Extension type
    (CZI Performance track, PR #3683)
+  * Add higher performance AnalysisBase derived Major and Minor pair distances to


We usually go newest first for these entries, could you move it to the top of the enhancements list?

richardjgowers

Thanks for doing this! Refactoring old code is never pleasant.

orbeckst

Thank you for the rewrite.

I have a bunch of readability comments (see inline).

I agree with @IAlibay that we should remove results['times']. I also agree that results needs restructuring, both for consistency but also for ease of accessing the data with e.g. mdacli. Let's raise a separate issue where others can also easily chime in. Normally we adhere to semantic versioning. However, I would consider this module still experimental and I am willing to break strict semver here and just change results without deprecations. The changes to results can go into a separate PR (preferrable) or this one (if a lot easier).

package/MDAnalysis/analysis/nucleicacids.py

orbeckst · 2022-06-30T22:29:15Z

package/MDAnalysis/analysis/nucleicacids.py

+    ValueError
+    if the residues given are not amino acids
+    ValueError
+    if the selections given are not the same length


indentation

orbeckst · 2022-06-30T22:29:53Z

package/MDAnalysis/analysis/nucleicacids.py

+
+    """
+
+    def __init__(self, strand1: List[Residue], strand2: List[Residue],


Union(List[Residue], ResidueGroup) ?

I'll add that, I looked at ResidueGroup and there is no reason NucPairDist can't accept it or Residue

I used a type alias to do it so that the code is less verbose, what do you think about that solution? It is a bit less clear in the docs, but simplifies the code.

I don't understand what a type alias is and where you used it — sorry, apologies for my ignorance. Can you please explain and comment on the corresponding piece of code?

At least from the docs it is not clear to me that the class would now also accept strand1: ResidueGroup — ... at least from your comment I surmise that's what should be possible now.

@orbeckst A type alias is when you give a specific name to a composite type in my case ResidueClass is an alias for Union[Residue, ResidueGroup] so my code's type signatures are a bit less verbose

orbeckst · 2022-06-30T22:31:29Z

package/MDAnalysis/analysis/nucleicacids.py

+                 **kwargs) -> None:
+        sel1: List[mda.AtomGroup] = []
+        sel2: List[mda.AtomGroup] = []
+        strand = zip(strand1, strand2)


plural "strands" ?

Or perhaps clearer, double_strand?

I set it to strands

package/MDAnalysis/analysis/nucleicacids.py

orbeckst · 2022-06-30T23:37:23Z

Before going much further here we should discuss the format of Results — please see #3744.

IAlibay · 2023-07-31T11:07:15Z

@ALescoulie It would be great to not let this work get forgotten, are you still looking to contribute this?

ALescoulie · 2023-08-18T22:06:23Z

Yeah, I'll continue working on this. Sorry for abandoning it, I was busy with school and dealing with mental health issues. I'm getting the development repo set up on my desktop now.

orbeckst · 2023-08-18T22:34:39Z

Hi @ALescoulie good to hear from you! Would be fantastic to get your work in!

github-actions · 2023-08-19T07:38:59Z

Linter Bot Results:

Hi @ALescoulie! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location	Outcome
main package	⚠️ Possible failure
testsuite	⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/6465590293/job/17552033113

Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

ALescoulie · 2023-08-21T06:31:32Z

I pushed a larger change to the PR. It was a bit hard picking up where I left off, but I was able to jump back into developing without too many issues.

I refactored my code so that all of the Pair distance analysis classes use a NucPairDist static method select_strand_atoms which takes in the strands, selected atoms names, and the nucleic acid names. It then returns a tuple containing the the lists of atoms for each strand. It also has checks ensuring that the strands only contain Nucleic Acids and that the selections don't return empty AtomGroups. I also added tests for both errors.

TODO

Reformat NucPairDist to use the newer results format
Fix misc docs issues
Update the change log

Co-authored-by: Oliver Beckstein <[email protected]>

descriptions and fix misc typos

strand1 and strand2 are different types

simplify type checks and catch TypeError

ALescoulie · 2023-09-12T00:36:17Z

@orbeckst I believe I addressed your feedback on the documentation, and added some further clarity. I also rewrote how WatsonCrickDist handles List[Resiude] in order to catch possible errors when both a List[Residue] and a ResidueGroup are passed in and in cases where the list is not entirely Residues. I think I should still add one additional test to get back to 100% coverage as there is one if statement I don't hit, I also will go read the documentation one more time.

orbeckst

Thanks for addressing my comments. The only real issue is that we need to deprecate results.pair_distance in the WatsonCrick class properly.

In the new classes there's nothing to deprecate as they appear here for the first time.

package/MDAnalysis/analysis/nucleicacids.py

orbeckst · 2023-09-13T00:13:51Z

package/MDAnalysis/analysis/nucleicacids.py


    .. versionchanged:: 2.5.0
       Accessing results by passing strand indices to :attr:`results` is
       was deprecated and is now removed as of MDAnalysis version 2.5.0. Please
       use :attr:`results.pair_distances` instead.
       The :attr:`results.times` was deprecated and is now removed as of
       MDAnalysis 2.5.0. Please use the class attribute :attr:`times` instead.
+
+    .. versionchanged:: 2.7.0
+    .. _Deprecation Notice


as above : does the anchor work, did you check the generated docs?

`pair_distances` in WCDist

verification TypeError

ALescoulie · 2023-09-27T18:19:41Z

@orbeckst I think this PR about ready, assuming I didn't miss any typos in the docs. I also pulled most of the atom selections in the tests into a fixture so there is less repetition when running the CI.

- use deprecated sphinx role - fixed minor errors in descriptions

orbeckst

Thanks @ALescoulie , looks good.

I made some changes to the docs, mainly to use sphinx markup for deprecations, but I also included a few fixes. In particular, please have a look for yourself that I correctly defined results.distances.

package/MDAnalysis/analysis/nucleicacids.py

orbeckst · 2023-10-10T01:56:07Z

package/MDAnalysis/analysis/nucleicacids.py

+        Distances are stored in a 2d numpy array with axis 0 (first index)
+        indexing the trajectory frame and axis 1 (second index) selecting the  
+        Residue pair.


@ALescoulie please check that this rewritten description is correct. It is now consistent with the description of results.pair_distances.

package/MDAnalysis/analysis/nucleicacids.py

orbeckst

In principle all good, but please fix PEP8 things.

ALescoulie · 2023-10-10T03:39:26Z

@orbeckst, I'll get it done by tomorrow than I can start on the next part of #3720, the pep8 stuff will just take a few minutes.

ALescoulie · 2023-10-10T06:32:22Z

@orbeckst I got the pep8 stuff done

orbeckst · 2023-10-10T13:56:18Z

Hooray. I reopen #3720 for the other things.

ALescoulie added 3 commits June 17, 2022 15:27

commit major pair and minor pair classes

01d08a9

add minor pair test

cc12345

commit major pair dist

9d3f644

github-actions bot added the Component-Analysis label Jun 26, 2022

ALescoulie added 2 commits June 26, 2022 11:37

update CHANGELOG

2de4021

bring up to PEP8

16abe7b

ALescoulie mentioned this pull request Jun 26, 2022

Re-implement nuclinfo using AnalysisBase style subclasses #3720

Open

5 tasks

Merge branch 'MDAnalysis:develop' into nuclinfo_dist_rebuild

7215d49

IAlibay requested changes Jun 27, 2022

View reviewed changes

IAlibay previously requested changes Jun 27, 2022

View reviewed changes

richardjgowers reviewed Jun 29, 2022

View reviewed changes

orbeckst mentioned this pull request Jun 30, 2022

modernize testing code #3743

Open

orbeckst requested changes Jun 30, 2022

View reviewed changes

orbeckst mentioned this pull request Jun 30, 2022

change nucleicacids results dict #3744

Closed

IAlibay added the close? Evaluate if issue/PR is stale and can be closed. label Jul 31, 2023

orbeckst removed the close? Evaluate if issue/PR is stale and can be closed. label Aug 18, 2023

Merge branch 'develop' into nuclinfo_dist_rebuild

c4936b9

ALescoulie added 5 commits August 21, 2023 12:40

fix documentation formatting

c408c69

refactor to use select_strand_atoms staticmethod

ea6bec0

fix major and minor pair selections

7f6600c

refactor tests to use pytest approx

1b1f387

remove dihedral import

9165810

ALescoulie and others added 4 commits September 5, 2023 14:17

Add future decrication notice to WCDist

eecbf76

Co-authored-by: Oliver Beckstein <[email protected]>

update documentation with more detailed

21aa853

descriptions and fix misc typos

fix DeprecationWarning in WCDist for case where

fd35fcb

strand1 and strand2 are different types

add verification helper method to WCDist init to

8792223

simplify type checks and catch TypeError

orbeckst requested changes Sep 13, 2023

View reviewed changes

ALescoulie added 5 commits September 25, 2023 21:44

Add deprecation notice to docs and changelog for

c9cdf40

`pair_distances` in WCDist

fix deprecation_warn test and add test for strand

8c901cb

verification TypeError

fix docs issues

d43cbb9

fix formatting

815c76f

simplify tests

78a1ef1

Merge branch 'develop' into nuclinfo_dist_rebuild

32af022

ALescoulie requested a review from orbeckst October 4, 2023 20:30

orbeckst added 2 commits October 9, 2023 17:05

Update CHANGELOG

087c8d6

Update docs in nucleicacids.py

2d612c2

- use deprecated sphinx role - fixed minor errors in descriptions

orbeckst approved these changes Oct 10, 2023

View reviewed changes

orbeckst reviewed Oct 10, 2023

View reviewed changes

package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved

orbeckst reviewed Oct 10, 2023

View reviewed changes

package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved

Apply suggestions from code review

d7a23f0

orbeckst requested changes Oct 10, 2023

View reviewed changes

ALescoulie added 3 commits October 10, 2023 06:19

fix pep8

d625933

fix pep8 after rebase

4b908ab

fix pep8 after rebase

ee44dca

ALescoulie force-pushed the nuclinfo_dist_rebuild branch from 06a55cb to ee44dca Compare October 10, 2023 06:31

orbeckst approved these changes Oct 10, 2023

View reviewed changes

orbeckst merged commit 18372f1 into MDAnalysis:develop Oct 10, 2023
21 checks passed

		from MDAnalysis.analysis.nucleicacids import WatsonCrickDist,\
		MajorPairDist, MinorPairDist


		"""

		def __init__(self, strand1: List[Residue], strand2: List[Residue],

Nuclinfo Major Pair and Minor Pair overhaul #3735

Nuclinfo Major Pair and Minor Pair overhaul #3735

Conversation

ALescoulie commented Jun 26, 2022 • edited Loading

PR Checklist

pep8speaks commented Jun 26, 2022 • edited Loading

Comment last updated at 2023-10-10 06:31:35 UTC

codecov bot commented Jun 26, 2022 • edited Loading

Codecov Report

IAlibay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IAlibay left a comment

Choose a reason for hiding this comment

IAlibay Jun 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IAlibay Jun 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardjgowers left a comment

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Jun 30, 2022

IAlibay commented Jul 31, 2023

ALescoulie commented Aug 18, 2023

orbeckst commented Aug 18, 2023

github-actions bot commented Aug 19, 2023 • edited Loading

Linter Bot Results:

ALescoulie commented Aug 21, 2023 • edited Loading

TODO

ALescoulie commented Sep 12, 2023

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ALescoulie commented Sep 27, 2023

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

ALescoulie commented Oct 10, 2023

ALescoulie commented Oct 10, 2023

orbeckst commented Oct 10, 2023

ALescoulie commented Jun 26, 2022 •

edited

Loading

pep8speaks commented Jun 26, 2022 •

edited

Loading

codecov bot commented Jun 26, 2022 •

edited

Loading

IAlibay Jun 27, 2022 •

edited

Loading

IAlibay Jun 27, 2022 •

edited

Loading

github-actions bot commented Aug 19, 2023 •

edited

Loading

ALescoulie commented Aug 21, 2023 •

edited

Loading