-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
symbol and char in the elements.txt #6
Comments
@dengjianyuan I was just about to ask the same question. The function that describes how they standardise SMILES is here MolBERT/molbert/utils/featurizer/molfeaturizer.py Line 1099 in b410cb6
There is a flag called canonicalise which is set to True .
Also there is a special set of chars here which includes @: MolBERT/molbert/utils/featurizer/molfeaturizer.py Line 1028 in b410cb6
@@ is already included in here MolBERT/molbert/utils/data/elements.txt Line 120 in b410cb6
MolBERT/molbert/utils/featurizer/molfeaturizer.py Line 1014 in b410cb6
So to answer 2) I think both @@ and @ are considered but in different parts of the code. I would however like to know if the input seqs (after they standardise SMILES) contain any isomeric information (which is critical for some ligands such as Carbohydrates) |
Hey Guys, thanks for your interest in our work and apologies for the slow reply. I am also very glad we can provide source code. I believe the same answer applies to many of the queries raised here. Since we tokenize SMILES char by char, we must handle multi-character elements differently. This is why they are separated in code. e.g. [Os] should be treated as Osmium and not aliphatic oxygen, aromatic sulphur. For most cases, our solution is to Kekulize our SMILES. After kekulization, aromatic sulphur is now upper case, and can no longer be confused with [Os]. In the special case of aromatic [Se], we handle this differently since it is a two-character element that has both aliphatic and aromatic forms. This is also the reason for separating @ and @@. |
@LivC182 Regarding input sequences containing stereoisomeric information. MolBERT can handle @ and @@. This would be tokenized, encoded by the featurizer and potentially learned by our representation. However, we have not tested this since our training dataset (taken from GuacaMol) does not contain stereochemistry |
Hi,
Thank you for providing the source codes on MOLBERT, which is a great work!
I have two questions on the elements.txt.
Many thanks in advance!! =)
The text was updated successfully, but these errors were encountered: