symbol and char in the elements.txt #6

dengjianyuan · 2021-01-29T18:46:02Z

Hi,

Thank you for providing the source codes on MOLBERT, which is a great work!

I have two questions on the elements.txt.

Why only 'se' is denoted as AromaticSe? How about aromatic C/N/S, etc?
Why only @@ is recorded for chirality? Don't we also need to record @ for counter clockwise spiral, which is a common symbol is SMILES strings...

Many thanks in advance!! =)

LivC193 · 2021-02-19T18:19:38Z

@dengjianyuan I was just about to ask the same question. The function that describes how they standardise SMILES is here

MolBERT/molbert/utils/featurizer/molfeaturizer.py

Line 1099 in b410cb6

    
           def standardise(self, smiles: str, canonicalise: Optional[bool] = None) -> Optional[str]:

There is a flag called canonicalise which is set to True.

Also there is a special set of chars here which includes @:

MolBERT/molbert/utils/featurizer/molfeaturizer.py

Line 1028 in b410cb6

self.smiles_special_chars = (

@@ is already included in here

MolBERT/molbert/utils/data/elements.txt

Line 120 in b410cb6

Chirality,@@,-1,α

and used here:

MolBERT/molbert/utils/featurizer/molfeaturizer.py

Line 1014 in b410cb6

self.elements, self.chars = self.load_periodic_table()

So to answer 2) I think both @@ and @ are considered but in different parts of the code. I would however like to know if the input seqs (after they standardise SMILES) contain any isomeric information (which is critical for some ligands such as Carbohydrates)

JoshuaMeyers · 2021-02-22T13:01:52Z

Hey Guys, thanks for your interest in our work and apologies for the slow reply. I am also very glad we can provide source code.

I believe the same answer applies to many of the queries raised here. Since we tokenize SMILES char by char, we must handle multi-character elements differently. This is why they are separated in code. e.g. [Os] should be treated as Osmium and not aliphatic oxygen, aromatic sulphur.

For most cases, our solution is to Kekulize our SMILES. After kekulization, aromatic sulphur is now upper case, and can no longer be confused with [Os]. In the special case of aromatic [Se], we handle this differently since it is a two-character element that has both aliphatic and aromatic forms.

This is also the reason for separating @ and @@.

JoshuaMeyers · 2021-02-22T13:05:03Z

@LivC182 Regarding input sequences containing stereoisomeric information. MolBERT can handle @ and @@. This would be tokenized, encoded by the featurizer and potentially learned by our representation. However, we have not tested this since our training dataset (taken from GuacaMol) does not contain stereochemistry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

symbol and char in the elements.txt #6

symbol and char in the elements.txt #6

dengjianyuan commented Jan 29, 2021

LivC193 commented Feb 19, 2021 •

edited

Loading

JoshuaMeyers commented Feb 22, 2021

JoshuaMeyers commented Feb 22, 2021

symbol and char in the elements.txt #6

symbol and char in the elements.txt #6

Comments

dengjianyuan commented Jan 29, 2021

LivC193 commented Feb 19, 2021 • edited Loading

JoshuaMeyers commented Feb 22, 2021

JoshuaMeyers commented Feb 22, 2021

LivC193 commented Feb 19, 2021 •

edited

Loading