-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation with different training poses? #18
Comments
Thanks for your interest in our work. We hadn't tried training the model with more data but we speculated that more data are useful because the current training set is still limited. As for the dataset filtering and processing, we directly used the data sets of the previous work, which filtered out data points with binding pose RMSD greater than 1Å, use mmseqs2 to cluster data at 30% sequence identity and then randomly drew 100,000 pairs for training. You can refer to that work for more details about data filtering. |
Thank u for detailed reply! But crossdocked datasets provide recptor/gninatypes, while crossdocked 10 provided here with pocket and docked sdf. I wonder if the data preprocessing can be code available here :), kind of new learner in drug design! |
You can refer to the CrossDocked dataset for the details of the dataset. We found a file import os
import shutil
import pickle
import argparse
from tqdm.auto import tqdm
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--source', type=str, default='./data/crossdocked')
parser.add_argument('--dest', type=str, required=True)
parser.add_argument('--rmsd_thr', type=float, default=1.0)
args = parser.parse_args()
os.makedirs(args.dest, exist_ok=False)
types_path = os.path.join(args.source, 'types/it2_tt_completeset_train0.types')
index = []
with open(types_path, 'r') as f:
for ln in tqdm(f):
_, _, rmsd, protein_fn, ligand_fn, _ = ln.split()
rmsd = float(rmsd)
if rmsd > args.rmsd_thr:
continue
ligand_id = int(ligand_fn[ligand_fn.rfind('_')+1:ligand_fn.rfind('.')])
protein_fn = protein_fn[:protein_fn.rfind('_')] + '.pdb'
ligand_raw_fn = ligand_fn[:ligand_fn.rfind('_')] + '.sdf'
protein_path = os.path.join(args.source, protein_fn)
ligand_raw_path = os.path.join(args.source, ligand_raw_fn)
if not (os.path.exists(protein_path) and os.path.exists(ligand_raw_path)):
continue
with open(ligand_raw_path, 'r') as f:
ligand_sdf = f.read().split('$$$$\n')[ligand_id]
ligand_save_fn = ligand_fn[:ligand_fn.rfind('.')] + '.sdf' # include ligand id
protein_dest = os.path.join(args.dest, protein_fn)
ligand_dest = os.path.join(args.dest, ligand_save_fn)
os.makedirs(os.path.dirname(protein_dest), exist_ok=True)
os.makedirs(os.path.dirname(ligand_dest), exist_ok=True)
shutil.copyfile(protein_path, protein_dest)
with open(ligand_dest, 'w') as f:
f.write(ligand_sdf)
index.append((protein_fn, ligand_save_fn, rmsd))
index_path = os.path.join(args.dest, 'index.pkl')
with open(index_path, 'wb') as f:
pickle.dump(index, f)
print('Done. %d protein-ligand pairs in total.' % len(index)) Besides, you can also refer to the data processing part of a previous method LiGAN. |
Wonderful work!
Found that it was trained using 100 k poses, is there significant gain with more poses included in training set?
I am trying to scale the data and see what happens? Is the code with filtering logic from crossdocked available.
The text was updated successfully, but these errors were encountered: