Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version of coordinate generatign processor #328

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
397 changes: 397 additions & 0 deletions vermouth/processors/generate_coordinates.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,397 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright 2018 University of Groningen
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Provides a processor for building missing coordinates.
"""

import numpy as np
import networkx as nx

from .processor import Processor


def mindist(A, B):
return ((A[:, None] - B[None])**2).sum(axis=2).min(axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that for many linear algebra functions it is tempting to use "math like" names. However, it is very hard to read. I'd suggest to at least in the doc string and even better the variable name itself, to at least mention if the objects A,B are both vectors, matrices and what the requirements on their shapes are. Ideally you'd even use 'vector_a' or 'matrix_a'. This holds throughout the document.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I really disagree that math like names and algorithms are harder to read, for sure in (almost) purely mathematical algorithms. I have more difficulty with convoluted prose. I do agree that the concise variables should be explicated in the docstring, along with the equations used. The same for mathematical texts, right? In general, especially for straightforward mathematical algorithms, I do prefer staying close to the equations, as these will also be present in accompanying documentation (including the article and docs) and overall it makes code more digestable. This does require a consistent notation, which is a bit harder in code, and it may be better to use simple prefixes (a for array, m for matrix, v for vector, etc). I'll update the docstrings. If you guys prefer prefixes, then I can add them. But longer names would not generally help, because lines become longer, may need to be split and in some cases may cause more use of memory and/or overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhere in the middle. Changing A, B to vector_A, vector_B indeed does not add much over describing them in the docstring. However, A and B (usually) refer to some physical thing, such as an angle or position, and adding those does benefit understanding/comprehension. So how about e.g. position_A or pos_A, etc?

But longer names would not generally help, because lines become longer, may need to be split and in some cases may cause more use of memory and/or overhead.

I don't mind long lines too much (anything under 100 characters is fine, over 120 is not) (comment and docstrings should be <80). And I don't think your memory/overhead argument holds. In this function the intermediate results of the subtraction, sum, and min are all stored in memory separately at some point. Maybe also the squaring even.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the function mentioned they are fully abstract; they're just sets of vectors, 2D numpy arrays, matrices. They could be positions, angles, RNA transcript counts, whatever. In the functions _chiral and _out, the input signatures refer to anchor and target, and the arithmetic short-hand variables refer to the schematics. I think that warrants the use as is. The one-letter references to conic point sets and some other things could benefit from less cryptic variable names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.



def get_missing_atoms(molecule):
'''
Determine particles without coordinates in molecule.
'''
return [
ix for ix in molecule
if molecule.nodes[ix].get('position') is None
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def get_anchors(molecule, indices):
'''
Determine particles with known coordinates connected to
particles with given indices.
Comment on lines +46 to +47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring needs improving

'''
return {
a for ix in indices for a in molecule[ix].keys()
if molecule.nodes[a].get('position') is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use selector_has_position

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add select_missing to selector.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an option, but isn't easier to just negate the outcome of has_position?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here yes, but in the other function the purpose is to select all atoms without coordinates. The purpose suggests that selectors.py is a better place. Probably less so for 'anchors' which are more specific to the building routines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, especially in combination with selectors.filter_minimal. In general I'm rather unhappy with the selection functions at the moment, since it's completely unclear whether they select molecules, atoms, nodes in molecules, or something else entirely. So feel free to use your own best judgement, and we'll clean up the selectors at a later moment.

}


def align_z(v):
'''
Generate rotation matrix aligning z-axis onto v (and vice-versa).
'''

# This routine is based on the notion that for any two
# vectors, the alignment is a 180 degree rotation
# around the resultant vector.

w = v / (v ** 2).sum() ** 0.5

if w[2] <= 1e-8 - 1:
return -np.eye(3) # Mind the inversion ...

w[2] += 1

return (2 / (w**2).sum()) * w * w[:, None] - np.eye(3)


def conicring(n=24, distance=0.125, angle=109.4, axis=None):
'''
Create a ring of positions corresponding to a cone with given angle.
An angle of 109.4 is suitable for ideal SP3 substituents.
'''
p = 2 * np.pi * np.arange(n) / n
r = distance * np.sin(angle * np.pi / 360)
z = distance * np.cos(angle * np.pi / 360)
X = np.stack([r * np.cos(p), r * np.sin(p), z * np.ones(n)], axis=1)
if axis is None:
return X
return X @ align_z(axis)


def double_cone(n=24, distance=0.125, angle=109.4, axis=None):
'''Create positions for two consecutive (SP3) bonded particles'''
P = conicring(n, distance, angle, axis)
S = np.array([P @ align_z(x) + x for x in P])
return P, S


def triple_cone(n=24, distance=0.125, angle=109.4):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we sum up the cone functions such that you say cones(n_conscutive_SP3s, anchor ...)? And do I get this right that this is for linear bonding i.e. triple cone is SP3-SP3-SP3 not SP3-(SP3)2? So branching doesn't get covered?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will for sure be possible to generalize that. It's not, however, just for linear bonding. You can select multiple points (branches) at each level, which will give you (access to) the corresponding possible points at the next.

'''Create possible positions for three consecutive (SP3) bonded particles'''
P, S = double_cone(n, distance, angle)
T = np.array([[[P @ align_z(u - x) + u] for u in U] for x, U in zip(P, S)])
return P, S, T


def _out(molecule, anchor, target=None, control=None, distance=0.125):
'''
Generate coordinates for atom bonded to anchor
using resultant vector of known substituents.

This approach works for placement of single unknown
substituents on sp1, sp2 or sp3 atoms. It should also
work for bipyramidal and other configurations.

b1
\
b2--a-->X
/
b3
Comment on lines +109 to +114
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if the variable names would at least be matching the drawing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concur.


Parameters
----------

Returns
-------
None
'''
a = molecule.nodes[anchor]['position']
if control == None:
B = [
molecule.nodes[ix]['position']
for ix, v in molecule[anchor].items()
if v != {}
]
else:
B = [ molecule.nodes[ix]['position'] for ix in control ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that the shape of B determines how the atom is rebuild?

Now the default is to get B from the anchor, which otherwise you get from indices. For making the code more readable and testable, I think it would be better to either accept anchors or only accept indices. It seems redundant to have both. Maybe I'm mistaken, but I'm happy to learn.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It's a remnant, but I was afraid of removing it in case it proved useful later.

B = B - a
u = -B.sum(axis=0)
u *= distance / ((u ** 2).sum() ** 0.5)
pos = a + u
if target is not None:
molecule.nodes[target]['position'] = pos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here I'd argue it is better to either update the molecular position or to return the built position. That would be a much clearer signature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this too is a remnant of some earlier snippets. What would be preferred? I think changing by reference is maybe too implicit, and the function would be more useful if it only returns positions, but that adds more overhead in the calling function.


return pos


def _chiral(molecule, anchor, up=None, down=None, control=None, distance=0.125):
'''
Generate coordinates for atoms bonded to chiral center
based on two known substituents.

This approach uses the positions of two substituents to
generate coordinates for one or two atoms connected to
a chiral center. The two atoms are indicated as `up` and
`down`, referring to the direction with respect to the
cross-product of the two substituents. If these control
atoms are not given explicitly in order, then the control
atoms are inferred from the connections of the anchor with
known coordinates. In that case, the specific chirality
of the center (R or S) cannot be controlled.

NOTE: This function should probably be able to determine
the weights of the substituents to allow placement
based on R or S chirality.

Parameters
----------


Returns
-------
None
'''
a = molecule.nodes[anchor]['position']
if control is None:
B = [
molecule.nodes[ix]['position']
for ix, v in molecule[anchor].items()
if v != {}
]
else:
B = [ molecule.nodes[ix]['position'] for ix in control ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above; also the fact that this code fragment occurs in two functions tells me that it could also be recycled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tsjerk how thoroughly did you test this code on pdb files?

I tested it with several protein PDB files (single chain and multichain), including with listed missing atoms and with all side chains simply removed by hand. In all cases the atoms are rebuild to the point that conversion to CG is feasible. Some particles, notably hydrogens, may be misplaced, but that doesn't significantly affect the resulting CG model. It does, at present, make it impossible to use this to fix structures for atomistic simulations. I did not test cases with non-protein missing atoms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact the placement of hydrogen matters quite a lot in Martini3. For example, rings mapped from COG without hydrogen will be too small. In general I agree that is fine because energy minimization takes care of it, but the elastic network is a problem. The EN will refer to the "unrelated" coordinates. How sever is this problem? I don't know. I do know, however, that for Martin3 DNA it has gone to a point that martinize2 is unusable with the desired mapping, because the EN counteracts the relaxed volume geometry. Therefore accurate reconstruction of atoms does matter to some extent also for CG models.

Furthermore, martinize2 was written with the intention of being usable independent of resolution. Yet this coordinate generator is another thing demonstrating that it is actually targeted specifically for the conversion of AA to CG nothing more nothing less. And in this aspect it is even worse than PDB2GMX. Not that we would have time to fix all this, but it should have been clear that an approximate coordinate generator is not really good enough for what martinze2 claims to be. That is my opinion on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the method needs to be improved. This is a first take on it. It is, however, not an 'approximate coordinate generator'. To get this first version, a few shortcuts were taken, which can be relatively easily expanded on, but that takes time. The elastic network could be a concern. Is that applied on side chains? Or only on backbone beads? The latter would not be a concern. The first would, because the orientation of side chains cannot be considered optimized with respect to the neighborhood. A workaround would be to do a simulation with backbones fixed to allow side chains to equilibrate before applying the EN. But that's workflow-wise for the case where you have (many) missing parts). I wouldn't want to suggest doing too much energy minimizing stuff in martinize.

A somewhat greater concern of mine is the lack of chirality information, which is really problematic for atomistics scale models. But that cannot only be solved in this part, but also needs directives in the blocks/links that can then be taken into account. Here, the processor already has the mechanics (up/down), but nothing to base the placement on.

If it is only hydrogens that need to be built, the result is pretty much the same as what pdb2gmx does. pdb2gmx cannot build sidechains, not even a single atom. It can build you a C- or N-terminus, but that's it. So this is already miles better.

Okay, an inventory of the problems:

  1. The CB is, for some reason, considered SP2, which causes HB1 (HB in VAL) to be placed on top of CA. This only happens if CB is built by the routine, not if it is already present. (expected to be easy to fix)
  2. Rings are not constructed correctly. (expected to be somewhat hard)
  3. The arginine guanidinium group is not planar. (expected to be relatively easy)

Otherwise the sidechains are well-formed, although not always placed optimally. There are optimizations possible without going to energy minimization.

Copy link
Member

@pckroon pckroon Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds pretty good to me.

I hereby veto any and all energy minimisation in vermouth/martinize2, which should answer the fist paragraph. It would, however, be good (excellent even) to issue a warning when reconstructing chains of atoms (a H is fine, a CH3 probably also fine, CH2-CH3 probably not) to advise users to energy minimize before doing e.g. elastic networks.

Chirality: Detect when something could be chiral (95%), and warn. No longer our problem. We can revisit this later.

1 and 3 should be fixed.
For 2 we can cheat as far as I'm concerned. Just assume it's a perfect polygon, and reconstruct the ring as-a-whole. Not ideal, especially regarding sugar, but maybe good enough?

Bottom line: best effort, and if we/you don't trust the result make a loud warning explaining to the user what they can do. If there's a warning no files are written (unless maxwarn), so it'll no longer be our problem.

B = B - a
u = -0.5 * (B[0] + B[1])
v = np.cross(B[0], B[1])
# Set length of v to length to half distance between subs
v = 0.5 * ((B[0] - B[1])**2).sum()**0.5 * v / (v ** 2).sum()**0.5
# Normalization of vector sum
n = 1 / ((u + v)**2).sum()**0.5
# Duck type way to see if we got one or two distances
try:
rup, rdown = distance
except TypeError:
rup = rdown = distance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in terms of signature this isn't very clear either, because once distance acts as a float as in the default but then it can also be a list/array? You could have it as tuple or dict or any other way as well. In addition this try accept seems very broad and prevents sensible errors if the Type of distances is actually wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A distance could be specified now as float or as two floats. There's a laziness there. So for signatures sake it shall be two floats always.

pup = a + rup * n * (u + v)
pdown = a + rdown * n * (u - v)
if up is not None:
molecule.nodes[up]['position'] = pup
if down is not None:
molecule.nodes[down]['position'] = pdown

return pup, pdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before; either return distances or update molecule.



class Segment:
'''Class for subgraphs of (missing) atoms with anchors'''
def __init__(self, molecule, limbs):
self.molecule = molecule
self.anchors = limbs[0][0]
self.bases = limbs[0][1]
Comment on lines +198 to +202
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially, needs doc-string of what is what

self.limbs = [ s[2] for s in limbs ]

def __str__(self):
return ':'.join([str(self.anchors), str(self.bases), str(self.limbs)])

def extrude(self, distance=0.125, coords=None, contact=0.5):
mol = self.molecule

for i, (a, b) in enumerate(zip(self.anchors, self.bases)):
target = [ k for k in mol.neighbors(a) if k not in b ]
valence = len(target) + len(b)

if valence > 2 and len(b) == valence - 1:
# Trivial chiral/flat/bipyramidal/octahedral/...
pos = _out(mol, a, target[0], control=b)
for k in self.limbs:
k.discard(target[0])
# Simply update this segment and return it
self.bases[i] = [a]
self.anchors[i] = target[0]
return [self]

if valence == 4 and len(b) == 2:
# Trivial chiral - except for the actual chirality
# a simple 'up' or 'down' flag would be helpful
# amino acid side chain is 'up'
up, down = _chiral(mol, a, target[0], target[1], control=b)
# Simply update this segment and return it
for k in self.limbs:
k.discard(target[0])
self.bases[i] = [a]
self.anchors[i] = target[0]
return [self]

if valence == 1:
# Generate a sphere of points
# Check for overlaps
# Take a pick
# Should be done last anyway
# Best collected and stored
continue

# So a valence of 2 (missing 1), 3 (missing 2) or 4 (missing 3)
# The cases 2(1), 3(2), 4(3) can be

if len(b) == 1 and all(len(s) < 100 for s in self.limbs):
ax = mol.nodes[a]['position']
bx = mol.nodes[b[0]]['position']
P = conicring(axis=ax-bx) + ax
if coords is not None:
P[mindist(P, coords) < contact**2] = np.nan
P = P.reshape((len(self.limbs), -1, 3))
for px in P.transpose((1,0,2)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little bit of lint

if px.max() == np.nan:
continue
# Split the limbs to new segments and return those
#out = []
for t, pi in zip(target, px):
self.molecule.nodes[t]['position'] = pi
# for k in self.limbs:
# if len(k) > 1 and t in k:
# k.discard(t)
# for g in nx.connected_components(self.molecule.subgraph(k)):
# out.append(
# Segment(
# self.molecule,
# [(t, [[a]], g)],
# self.coords,
# self.contact
# )
# )
# This commented-out stuff may be useful for speeding up
# coordinate generation, because it avoids needing lookup
# of missing coordinates in subsequent cycles.
#return out
return
break
return None


class GenerateCoordinates(Processor):
"""
A processor for generating coordinates for atoms without.
"""
def __init__(self, contact=0.5):
super().__init__()
self.contact = contact
self.coordinates = None

def run_system(self, system):
"""
Process `system`.

Parameters
----------
system: vermouth.system.System
The system to process. Is modified in-place.
"""

self.coordinates = np.stack(
[
a.get('position', (np.nan, np.nan, np.nan))
for m in system.molecules
for i, a in m.atoms
]
)

for molecule in system.molecules:
self.run_molecule(molecule)

def run_molecule(self, molecule):
"""
Process a single molecule. Must be implemented by subclasses.

Parameters
----------
molecule: vermouth.molecule.Molecule
The molecule to process.

Returns
-------
vermouth.molecule.Molecule
The provided molecule with complete coordinates
"""

# Missing atoms - those without 'positions'
missing = get_missing_atoms(molecule)

prev = len(molecule)
while prev - len(missing) > 0:
# Limbs: subgraphs of missing atoms with anchors and bases
limbs = []
for g in nx.connected_components(molecule.subgraph(missing)):
anchors = list(get_anchors(molecule, g))
bases = [ list(get_anchors(molecule, [a])) for a in anchors ]
pckroon marked this conversation as resolved.
Show resolved Hide resolved
limbs.append((anchors, bases, g))
limbs.sort()

# Segment: subgraphs of limbs sharing anchors and bases
segments = []
prev = None
for lim in limbs:
if lim[0] != prev:
segments.append([])
prev = lim[0]
segments[-1].append(lim)
segments = [ Segment(molecule, limbs) for limbs in segments ]

# Extrude segments one layer at a time
while segments:
s = segments.pop(0)
r = s.extrude(coords=self.coordinates, contact=self.contact)
if r is not None:
segments.extend(r)

prev = len(missing)
missing = get_missing_atoms(molecule)
#print(len(missing))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover


return molecule


def main(args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be (re)moved before merging

'''
Main function for running processor with a PDB file as input (testing).
'''
pdb = args[1]

system = vermouth.System()
vermouth.PDBInput(pdb).run_system(system)
system.force_field = vermouth.forcefield.get_native_force_field('universal')

flow = [
vermouth.MakeBonds(allow_name=True, allow_dist=True, fudge=1),
vermouth.RepairGraph(delete_unknown=True, include_graph=True),
vermouth.SortMoleculeAtoms(),
GenerateCoordinates()
]

for process in flow:
process.run_system(system)

vermouth.pdb.write_pdb(system, args[2], nan_missing_pos=True)

return 0


if __name__ == '__main__':
import sys
import vermouth
import vermouth.forcefield
import vermouth.map_input

exco = main(sys.argv)
sys.exit(exco)