Inputs during training and inference #5

zchwang · 2024-06-27T06:55:00Z

Hi,

Excellent work! I am reading ProtSSN and trying to use it, but I have a few questions:

The input in downstream tasks is the entire protein, while the training set uses CATH domains. How does this difference affect the model's performance?
The model's inputs during training are crystal structures, but AF2 or ESM2 predicted structures are used for inference. How much bias does this introduce?
If I want to use ProtSSN for downstream tasks, do I just use the code provided in README to extract embeddings?

Congratulations again on your work!

Best regards

tyang816 · 2024-06-28T07:24:31Z

Hi, Wang,

Good question! From a biological perspective, the CATH domain already contains sufficient protein structure paradigms, but from a computational perspective, this is indeed a gap, and we will conduct additional experimental tests in the future.
We don't know how much error this will cause in dry experiments, because we can't get the crystal structure of most proteins. But we are currently doing wet experiment verification, and it seems that both the predicted structure and the crystal structure work well. We may be able to answer this question in our future iterative work.
I have added the new code for fine-tuning ProtSSN on any downstream tasks, you can see here. You could provide CSV with labels and PDB files for training, the dataset formation can be found here.

Thank you for your attention, and welcome to follow our latest work ProSST.😊

Provide feedback