Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

symbol and char in the elements.txt #6

Open
dengjianyuan opened this issue Jan 29, 2021 · 3 comments
Open

symbol and char in the elements.txt #6

dengjianyuan opened this issue Jan 29, 2021 · 3 comments

Comments

@dengjianyuan
Copy link

Hi,

Thank you for providing the source codes on MOLBERT, which is a great work!

I have two questions on the elements.txt.

  1. Why only 'se' is denoted as AromaticSe? How about aromatic C/N/S, etc?
  2. Why only @@ is recorded for chirality? Don't we also need to record @ for counter clockwise spiral, which is a common symbol is SMILES strings...

Many thanks in advance!! =)

@LivC193
Copy link

LivC193 commented Feb 19, 2021

@dengjianyuan I was just about to ask the same question. The function that describes how they standardise SMILES is here

def standardise(self, smiles: str, canonicalise: Optional[bool] = None) -> Optional[str]:

There is a flag called canonicalise which is set to True.

Also there is a special set of chars here which includes @:

self.smiles_special_chars = (

@@ is already included in here

Chirality,@@,-1,α
and used here:
self.elements, self.chars = self.load_periodic_table()

So to answer 2) I think both @@ and @ are considered but in different parts of the code. I would however like to know if the input seqs (after they standardise SMILES) contain any isomeric information (which is critical for some ligands such as Carbohydrates)

@JoshuaMeyers
Copy link

Hey Guys, thanks for your interest in our work and apologies for the slow reply. I am also very glad we can provide source code.

I believe the same answer applies to many of the queries raised here. Since we tokenize SMILES char by char, we must handle multi-character elements differently. This is why they are separated in code. e.g. [Os] should be treated as Osmium and not aliphatic oxygen, aromatic sulphur.

For most cases, our solution is to Kekulize our SMILES. After kekulization, aromatic sulphur is now upper case, and can no longer be confused with [Os]. In the special case of aromatic [Se], we handle this differently since it is a two-character element that has both aliphatic and aromatic forms.

This is also the reason for separating @ and @@.

@JoshuaMeyers
Copy link

@LivC182 Regarding input sequences containing stereoisomeric information. MolBERT can handle @ and @@. This would be tokenized, encoded by the featurizer and potentially learned by our representation. However, we have not tested this since our training dataset (taken from GuacaMol) does not contain stereochemistry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants