First version of coordinate generatign processor #328

Tsjerk · 2020-12-11T15:36:27Z

The coordinate generating processor. Unfortunately, the chirality is not always consistent, even for backbone, which needs to be fixed, and the side chains with rings are not well constructed. Otherwise, the coordinates are such that a protein stripped from side chains is rebuilt to be processable with martinize2.

fgrunewald

Overall looks pretty good in terms of a draft. In my opinion the main points for making it a working part of vermouth are:

missing docstrings and cryptic variable names. It is very time consuming to review (especially) the math if it is not clearly defined what even the shape of a variable is.
more clear signature of functions; in several cases functions return and do multiple things depending on the Type of the input variables and/or combination of input variables. I think in several places this can be more clear.
more testing. All 400 lines of code will need both unit tests and integral tests. This will be a huge amount of work, so probably this should be starting soon.
bench marking and profiling; there are several functions that operate on the complete graph. It doesn't look to me like they are very expensive to evaluate, but to be honest martinize2 cannot really afford to become much slower. So I'd highly recommend profiling in terms that that processor scales linearly with graph length/number of edges at the very lest.

fgrunewald · 2021-01-08T08:26:29Z

vermouth/processors/generate_coordinates.py

+def mindist(A, B):
+    return ((A[:, None] - B[None])**2).sum(axis=2).min(axis=1)


I understand that for many linear algebra functions it is tempting to use "math like" names. However, it is very hard to read. I'd suggest to at least in the doc string and even better the variable name itself, to at least mention if the objects A,B are both vectors, matrices and what the requirements on their shapes are. Ideally you'd even use 'vector_a' or 'matrix_a'. This holds throughout the document.

Personally I really disagree that math like names and algorithms are harder to read, for sure in (almost) purely mathematical algorithms. I have more difficulty with convoluted prose. I do agree that the concise variables should be explicated in the docstring, along with the equations used. The same for mathematical texts, right? In general, especially for straightforward mathematical algorithms, I do prefer staying close to the equations, as these will also be present in accompanying documentation (including the article and docs) and overall it makes code more digestable. This does require a consistent notation, which is a bit harder in code, and it may be better to use simple prefixes (a for array, m for matrix, v for vector, etc). I'll update the docstrings. If you guys prefer prefixes, then I can add them. But longer names would not generally help, because lines become longer, may need to be split and in some cases may cause more use of memory and/or overhead.

I'm somewhere in the middle. Changing A, B to vector_A, vector_B indeed does not add much over describing them in the docstring. However, A and B (usually) refer to some physical thing, such as an angle or position, and adding those does benefit understanding/comprehension. So how about e.g. position_A or pos_A, etc?

But longer names would not generally help, because lines become longer, may need to be split and in some cases may cause more use of memory and/or overhead.

I don't mind long lines too much (anything under 100 characters is fine, over 120 is not) (comment and docstrings should be <80). And I don't think your memory/overhead argument holds. In this function the intermediate results of the subtraction, sum, and min are all stored in memory separately at some point. Maybe also the squaring even.

In the function mentioned they are fully abstract; they're just sets of vectors, 2D numpy arrays, matrices. They could be positions, angles, RNA transcript counts, whatever. In the functions _chiral and _out, the input signatures refer to anchor and target, and the arithmetic short-hand variables refer to the schematics. I think that warrants the use as is. The one-letter references to conic point sets and some other things could benefit from less cryptic variable names.

Sounds good to me.

fgrunewald · 2021-01-08T08:29:02Z

vermouth/processors/generate_coordinates.py

+    Determine particles with known coordinates connected to
+    particles with given indices.


Docstring needs improving

fgrunewald · 2021-01-08T10:06:42Z

vermouth/processors/generate_coordinates.py

+    return P, S
+
+
+def triple_cone(n=24, distance=0.125, angle=109.4):


Can't we sum up the cone functions such that you say cones(n_conscutive_SP3s, anchor ...)? And do I get this right that this is for linear bonding i.e. triple cone is SP3-SP3-SP3 not SP3-(SP3)2? So branching doesn't get covered?

It will for sure be possible to generalize that. It's not, however, just for linear bonding. You can select multiple points (branches) at each level, which will give you (access to) the corresponding possible points at the next.

fgrunewald · 2021-01-08T10:15:11Z

vermouth/processors/generate_coordinates.py

+
+     b1
+       \
+    b2--a-->X
+       /
+     b3


Would be nice if the variable names would at least be matching the drawing.

fgrunewald · 2021-01-08T10:28:39Z

vermouth/processors/generate_coordinates.py

+    if control == None:
+        B = [
+            molecule.nodes[ix]['position']
+            for ix, v in molecule[anchor].items()
+            if v != {}
+        ]
+    else:
+        B = [ molecule.nodes[ix]['position'] for ix in control ]


I suppose that the shape of B determines how the atom is rebuild?

Now the default is to get B from the anchor, which otherwise you get from indices. For making the code more readable and testable, I think it would be better to either accept anchors or only accept indices. It seems redundant to have both. Maybe I'm mistaken, but I'm happy to learn.

Agreed. It's a remnant, but I was afraid of removing it in case it proved useful later.

fgrunewald · 2021-01-08T10:32:46Z

vermouth/processors/generate_coordinates.py

+    if target is not None:
+        molecule.nodes[target]['position'] = pos


Also here I'd argue it is better to either update the molecular position or to return the built position. That would be a much clearer signature.

I think this too is a remnant of some earlier snippets. What would be preferred? I think changing by reference is maybe too implicit, and the function would be more useful if it only returns positions, but that adds more overhead in the calling function.

fgrunewald · 2021-01-08T10:34:36Z

vermouth/processors/generate_coordinates.py

+    if control is None:
+        B = [
+            molecule.nodes[ix]['position']
+            for ix, v in molecule[anchor].items()
+            if v != {}
+        ]
+    else:
+        B = [ molecule.nodes[ix]['position'] for ix in control ]


same as above; also the fact that this code fragment occurs in two functions tells me that it could also be recycled.

@Tsjerk how thoroughly did you test this code on pdb files?

I tested it with several protein PDB files (single chain and multichain), including with listed missing atoms and with all side chains simply removed by hand. In all cases the atoms are rebuild to the point that conversion to CG is feasible. Some particles, notably hydrogens, may be misplaced, but that doesn't significantly affect the resulting CG model. It does, at present, make it impossible to use this to fix structures for atomistic simulations. I did not test cases with non-protein missing atoms.

In fact the placement of hydrogen matters quite a lot in Martini3. For example, rings mapped from COG without hydrogen will be too small. In general I agree that is fine because energy minimization takes care of it, but the elastic network is a problem. The EN will refer to the "unrelated" coordinates. How sever is this problem? I don't know. I do know, however, that for Martin3 DNA it has gone to a point that martinize2 is unusable with the desired mapping, because the EN counteracts the relaxed volume geometry. Therefore accurate reconstruction of atoms does matter to some extent also for CG models.

Furthermore, martinize2 was written with the intention of being usable independent of resolution. Yet this coordinate generator is another thing demonstrating that it is actually targeted specifically for the conversion of AA to CG nothing more nothing less. And in this aspect it is even worse than PDB2GMX. Not that we would have time to fix all this, but it should have been clear that an approximate coordinate generator is not really good enough for what martinze2 claims to be. That is my opinion on it.

Agreed, the method needs to be improved. This is a first take on it. It is, however, not an 'approximate coordinate generator'. To get this first version, a few shortcuts were taken, which can be relatively easily expanded on, but that takes time. The elastic network could be a concern. Is that applied on side chains? Or only on backbone beads? The latter would not be a concern. The first would, because the orientation of side chains cannot be considered optimized with respect to the neighborhood. A workaround would be to do a simulation with backbones fixed to allow side chains to equilibrate before applying the EN. But that's workflow-wise for the case where you have (many) missing parts). I wouldn't want to suggest doing too much energy minimizing stuff in martinize.

A somewhat greater concern of mine is the lack of chirality information, which is really problematic for atomistics scale models. But that cannot only be solved in this part, but also needs directives in the blocks/links that can then be taken into account. Here, the processor already has the mechanics (up/down), but nothing to base the placement on.

If it is only hydrogens that need to be built, the result is pretty much the same as what pdb2gmx does. pdb2gmx cannot build sidechains, not even a single atom. It can build you a C- or N-terminus, but that's it. So this is already miles better.

Okay, an inventory of the problems:

The CB is, for some reason, considered SP2, which causes HB1 (HB in VAL) to be placed on top of CA. This only happens if CB is built by the routine, not if it is already present. (expected to be easy to fix)

Rings are not constructed correctly. (expected to be somewhat hard)

The arginine guanidinium group is not planar. (expected to be relatively easy)

Otherwise the sidechains are well-formed, although not always placed optimally. There are optimizations possible without going to energy minimization.

This sounds pretty good to me.

I hereby veto any and all energy minimisation in vermouth/martinize2, which should answer the fist paragraph. It would, however, be good (excellent even) to issue a warning when reconstructing chains of atoms (a H is fine, a CH3 probably also fine, CH2-CH3 probably not) to advise users to energy minimize before doing e.g. elastic networks.

Chirality: Detect when something could be chiral (95%), and warn. No longer our problem. We can revisit this later.

1 and 3 should be fixed.
For 2 we can cheat as far as I'm concerned. Just assume it's a perfect polygon, and reconstruct the ring as-a-whole. Not ideal, especially regarding sugar, but maybe good enough?

Bottom line: best effort, and if we/you don't trust the result make a loud warning explaining to the user what they can do. If there's a warning no files are written (unless maxwarn), so it'll no longer be our problem.

fgrunewald · 2021-01-08T10:37:13Z

vermouth/processors/generate_coordinates.py

+    try:
+        rup, rdown = distance
+    except TypeError:
+        rup = rdown = distance


in terms of signature this isn't very clear either, because once distance acts as a float as in the default but then it can also be a list/array? You could have it as tuple or dict or any other way as well. In addition this try accept seems very broad and prevents sensible errors if the Type of distances is actually wrong.

A distance could be specified now as float or as two floats. There's a laziness there. So for signatures sake it shall be two floats always.

fgrunewald · 2021-01-08T10:37:52Z

vermouth/processors/generate_coordinates.py

+    if up is not None:
+        molecule.nodes[up]['position'] = pup
+    if down is not None:
+        molecule.nodes[down]['position'] = pdown
+
+    return pup, pdown


Same as before; either return distances or update molecule.

fgrunewald · 2021-01-08T10:45:45Z

vermouth/processors/generate_coordinates.py

+    '''Class for subgraphs of (missing) atoms with anchors'''
+    def __init__(self, molecule, limbs):
+        self.molecule = molecule
+        self.anchors = limbs[0][0]
+        self.bases = limbs[0][1]


Especially, needs doc-string of what is what

pckroon · 2021-01-18T11:22:16Z

vermouth/processors/generate_coordinates.py

+def get_missing_atoms(molecule):
+    '''
+    Determine particles without coordinates in molecule.
+    '''
+    return [
+        ix for ix in molecule
+        if molecule.nodes[ix].get('position') is None
+    ]


Replace for use of selector_has_position https://github.com/marrink-lab/vermouth-martinize/blob/master/vermouth/selectors.py#L67

pckroon · 2021-01-18T11:22:54Z

vermouth/processors/generate_coordinates.py

+    '''
+    return {
+        a for ix in indices for a in molecule[ix].keys()
+        if molecule.nodes[a].get('position') is not None


use selector_has_position

Add select_missing to selector.py?

Is an option, but isn't easier to just negate the outcome of has_position?

here yes, but in the other function the purpose is to select all atoms without coordinates. The purpose suggests that selectors.py is a better place. Probably less so for 'anchors' which are more specific to the building routines.

Makes sense, especially in combination with selectors.filter_minimal. In general I'm rather unhappy with the selection functions at the moment, since it's completely unclear whether they select molecules, atoms, nodes in molecules, or something else entirely. So feel free to use your own best judgement, and we'll clean up the selectors at a later moment.

pckroon · 2021-01-18T11:31:06Z

vermouth/processors/generate_coordinates.py

+                if coords is not None:
+                    P[mindist(P, coords) < contact**2] = np.nan
+                P = P.reshape((len(self.limbs), -1, 3))
+                for px in P.transpose((1,0,2)):


Little bit of lint

vermouth/processors/generate_coordinates.py

pckroon · 2021-01-18T11:34:00Z

vermouth/processors/generate_coordinates.py

+
+            prev = len(missing)
+            missing = get_missing_atoms(molecule)
+            #print(len(missing))


pckroon · 2021-01-18T11:34:47Z

vermouth/processors/generate_coordinates.py

+        return molecule
+
+
+def main(args):


Should be (re)moved before merging

pckroon

In general I like it!
I do agree with Fabian's comments though, and on top of that I have a few nitpicks. Also, it may be nice if this code produces some user output (logging!).

All that said, with deadlines and such approaching I'm open to merging this as-is (with a TODO comment, maybe) to speed up development at the cost of accruing maintenance debt
@Tsjerk how thoroughly did you test this code on pdb files?
@fgrunewald @jbarnoud Opinion?

jbarnoud · 2021-01-18T11:44:36Z

All that said, with deadlines and such approaching I'm open to merging this as-is (with a TODO comment, maybe) to speed up development at the cost of accruing maintenance debt
@Tsjerk how thoroughly did you test this code on pdb files?
@fgrunewald @jbarnoud Opinion?

As it is completely disconnected from the rest, I am not against it if it accelerates dev. Though, it means that actual integration will require more work since at least the tests are not in a usable state.

pckroon · 2021-01-18T11:49:05Z

Another thing that needs doing before merging is to add this to bin/martinize2 so it's actually used.

As it is completely disconnected from the rest, I am not against it if it accelerates dev. Though, it means that actual integration will require more work since at least the tests are not in a usable state.

How so? There are no tests for this code, which means Travis is still happy

fgrunewald · 2021-01-18T11:52:35Z

I am open for the option of merging without the code-coverage being 100%. However, I think it should come with at least basic testing on the geometry functions and someone testing of the model in the workflow of martinize. As far as I understand it is not integrated in to bin/martini2 yet and has only been tested via the main module provided.

Last time I tried running the main program it got tangled up in dependency errors. Maybe that is just me calling it; should be like this or am I mistaken:

python path_to_file/generate_coordinates.py name_of_pdb

Running the basic processor to reconstruct H atoms on CH2 groups it also gave a bunch of strange coordinate choices. Probably when you map them it is fine but it looks super strange in of you visualize it. Bottom line can someone else try running some tests. It could be that my PDBs are little special but as of now I'd really recommend an additional layer of testing.

jbarnoud · 2021-01-18T11:57:39Z

Another thing that needs doing before merging is to add this to bin/martinize2 so it's actually used.

As it is completely disconnected from the rest, I am not against it if it accelerates dev. Though, it means that actual integration will require more work since at least the tests are not in a usable state.

How so? There are no tests for this code, which means Travis is still happy

That's the thing: if it get integrated to bin/martinize2, then it needs tests. Once it is in the command line, then it can mess up the user's science. Or... it comes behind a flag with a massive warning about it being under development.

pckroon · 2021-01-28T08:54:30Z

@Tsjerk how thoroughly did you test this code on pdb files?

I tested it with several protein PDB files (single chain and multichain), including with listed missing atoms and with all side chains simply removed by hand. In all cases the atoms are rebuild to the point that conversion to CG is feasible. Some particles, notably hydrogens, may be misplaced, but that doesn't significantly affect the resulting CG model. It does, at present, make it impossible to use this to fix structures for atomistic simulations. I did not test cases with non-protein missing atoms.

Sounds reasonable. I'm a little surprised it struggles with hydrogens, I would expect those to be the easiest case. Could you try reconstructing the hydrogens from a small protein, and seeing whether the resulting structure survives Gromacs minimization? If so, I'd say is perfectly fine enough for now.
Did you try reconstructing proline sidechains and other (bi)cycles?

Tsjerk · 2021-01-28T09:04:09Z

@Tsjerk how thoroughly did you test this code on pdb files?

I tested it with several protein PDB files (single chain and multichain), including with listed missing atoms and with all side chains simply removed by hand. In all cases the atoms are rebuild to the point that conversion to CG is feasible. Some particles, notably hydrogens, may be misplaced, but that doesn't significantly affect the resulting CG model. It does, at present, make it impossible to use this to fix structures for atomistic simulations. I did not test cases with non-protein missing atoms.

Sounds reasonable. I'm a little surprised it struggles with hydrogens, I would expect those to be the easiest case. Could you try reconstructing the hydrogens from a small protein, and seeing whether the resulting structure survives Gromacs minimization? If so, I'd say is perfectly fine enough for now.

If it's only hydrogens, it works fine. That is indeed the simples case. There's a flaw with like the methyl of isoleucine.

Did you try reconstructing proline sidechains and other (bi)cycles?

These are built, but distorted, making them unsuitable for atomistic simulations, but doable for coarse graining.

Tsjerk · 2021-01-28T13:44:16Z

bench marking and profiling; there are several functions that operate on the complete graph. It doesn't look to me like they are very expensive to evaluate, but to be honest martinize2 cannot really afford to become much slower. So I'd highly recommend profiling in terms that that processor scales linearly with graph length/number of edges at the very lest.

No thorough benchmark, but it takes 260 ms on my machine (md41) to rebuild 18273 atoms in a protein that is stripped to the bare backbone.

pckroon · 2021-02-24T14:27:10Z

@Tsjerk Any progress/update?

…outh-martinize into coordinates

vermouth/processors/generate_coordinates.py

pckroon

Much better already :)
I still don't like the single letter variables. For example in _chiral and _out you can rename a -> anchor (yes, override the argument) and B -> bases. But I won't die on this hill.

The docstrings you wrote I like (nit aside). It's quite important that at least all public methods (e.g. mindist) describe the arguments like you do with _chiral. For _private methods we can be a little bit more relaxed. But ideally all have detailed docstrings.

pckroon · 2021-11-23T13:42:59Z

vermouth/processors/generate_coordinates.py

    }


 def align_z(v):
    '''
-    Generate rotation matrix aligning z-axis onto v (and vice-versa).
+    Generate rotation matrix aligning z-axis onto vector v (and vice-versa).


Is it an idea to rename v to vector here? It's a bit of a nitpick from my side, admittedly

pckroon · 2021-11-23T13:43:21Z

vermouth/processors/generate_coordinates.py

+    # normalized vectors, the alignment is a 180 degree
+    # rotation around the resultant vector.


This comment is so good I wonder if it should be part of the docstring.

pckroon · 2021-11-23T13:44:21Z

vermouth/processors/generate_coordinates.py

    Returns
    -------
-        None
+        position


Nit, also for the other docstring below:

Returns ---------- np.array The position of the outward pointing atom.

Sphinx is rather picky

First version of coordinate generatign processor

cb3c11b

fgrunewald requested changes Jan 8, 2021

View reviewed changes

pckroon reviewed Jan 18, 2021

View reviewed changes

vermouth/processors/generate_coordinates.py Show resolved Hide resolved

pckroon reviewed Jan 18, 2021

View reviewed changes

vermouth/processors/generate_coordinates.py

prev = len(missing)

missing = get_missing_atoms(molecule)

#print(len(missing))

Copy link

Member

pckroon Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover

pckroon reviewed Jan 18, 2021

View reviewed changes

vermouth/processors/generate_coordinates.py Outdated

return molecule

def main(args):

Copy link

Member

pckroon Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be (re)moved before merging

pckroon reviewed Jan 18, 2021

View reviewed changes

Tsjerk added 4 commits January 27, 2021 14:56

using selectors for getting atoms with missing coordinates

365be04

Removed main function from generate_coordinates

b639bc3

removed modify-by-reference in _out and _chiral

d925aec

oops. missed rename of call to get_missing_atoms

21a9359

Merge branch 'master' into coordinates

95882dc

pckroon mentioned this pull request Jan 28, 2021

Unflake hypothesis tests #333

Merged

registered coordinate generator processor in __init__.py

b0f28d7

Merge branch 'coordinates' of https://www.github.com/marrink-lab/verm…

914aa81

…outh-martinize into coordinates

pckroon reviewed Nov 19, 2021

View reviewed changes

vermouth/processors/generate_coordinates.py Outdated Show resolved Hide resolved

vermouth/processors/generate_coordinates.py Outdated Show resolved Hide resolved

Bug fixes in generate_coordinates.py

41992c0

pckroon requested changes Nov 23, 2021

View reviewed changes

fgrunewald force-pushed the master branch from 99a936f to 1fba29e Compare April 25, 2023 13:56

		def mindist(A, B):
		return ((A[:, None] - B[None])**2).sum(axis=2).min(axis=1)

		Determine particles with known coordinates connected to
		particles with given indices.

		return P, S


		def triple_cone(n=24, distance=0.125, angle=109.4):

		if target is not None:
		molecule.nodes[target]['position'] = pos

		# normalized vectors, the alignment is a 180 degree
		# rotation around the resultant vector.

First version of coordinate generatign processor #328

Are you sure you want to change the base?

First version of coordinate generatign processor #328

Conversation

Tsjerk commented Dec 11, 2020

fgrunewald left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pckroon Jan 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pckroon left a comment

Choose a reason for hiding this comment

jbarnoud commented Jan 18, 2021

pckroon commented Jan 18, 2021

fgrunewald commented Jan 18, 2021

jbarnoud commented Jan 18, 2021

pckroon commented Jan 28, 2021

Tsjerk commented Jan 28, 2021

Tsjerk commented Jan 28, 2021

pckroon commented Feb 24, 2021

pckroon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pckroon Jan 28, 2021 •

edited

Loading