Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM embeddings for complex did not have the right length for the protein #38

Open
liyue9129 opened this issue Nov 7, 2024 · 7 comments

Comments

@liyue9129
Copy link

liyue9129 commented Nov 7, 2024

Hi !
Great work !
I have encountered several issues during the docking process. And I am seeking to determine whether the issue originate from the protein structure or the ESM employed, and to pinpoint the exact nature of the problem.
LM embeddings for complex did not have the right length for the protein

image

Thank you !
best wishes!

@patjiang
Copy link

patjiang commented Nov 8, 2024

Hello! Rather than a screenshot of the code itself, could you also provide the associated stderr output? That is, if the code generates a print statement or a .out file, could you please show that as well?

From what I've seen with working in this package, I think this should not be an endemic issue, rather, this might happen sometimes in the "pseudo-dynamics" timesteps of the model-- that being said, I do believe that this issue might be moreso due to the structure rather than the esm embeddings. I hope this helps!

@liyue9129
Copy link
Author

Thank you so much for your help !
The associated stderr output is as follows:
image
image

Due to computer issues in the past few days resulting in data loss, I apologize for not being able to provide the specific PDB complex file.
However, during my previous debugging, I found that the problem originated from using the ESM embeddings to generate protein features, as shown in the red box in the figure below. The first and second chains are fine, but the third and fourth chains only encode 37 and 32 amino acids, respectively.

image

Best wishes !

@patjiang
Copy link

Hello,

Thank you for providing the associated issues; for context, do the third and fourth chains also expect 559 residues as well?

Also, are what args are you passing into the top level argument?

I hope to continue to provide support!

@liyue9129
Copy link
Author

liyue9129 commented Nov 16, 2024

Hi !

Yes, the 3rd and 4th chains also have 559 amino acids.

The hyperparameters are as follows and are consistent with the README:
` command = [
"python", "run_single_protein_inference.py",
f"/public/home/user/complex_preparation/posebusters/{data_name}/{name}/{name}_protein.pdb",
f"/public/home/user/complex_preparation/posebusters/{data_name}/{name}/{name}_ligand_smiles_to_csv.csv",
"--savings_per_complex", "40",
"--inference_steps", "20",
"--header", f"{name}",
"--device", "0",
"--python", "/public/home/user/miniconda3/envs/dynamicbind/bin/python",
"--relax_python", "/public/home/user/miniconda3/envs/relax/bin/python",
"--result", f"/public/home/user/DLDock/DynamicBind/test/{data_name}",
]

It may take up to a month for me to provide the specific PDB complex file.

Best wishes !

@patjiang
Copy link

Hello,

Don't worry so much about providing the pdb files, I have done the stack trace by hand for you:

Top level call: run_single_protein_inference.py
This runs and functions fine until: inference.py is called (here

Then, within inference.py, this line calls pdbBind, which leads to a call to the init here, which eventually calls inference_preprocessing here, which leads to the call for extract_receptor_structure here

This calls extract_receptor_structure -> which leads to your original problem, here

So, to place into context why this is important, we do this stack trace so see exactly at which point you have issues with lm_embeddings generation. From the stack trace, it seems that the issue comes from the input to the extract_receptor_structure function, which is likely lm_embeddings_chains_all. For reference, this is generated in these lines here.

Given no hts option in your top-level command, then I would next look into the path where the esm embeddings are generated, as there is different lm_embeddings_chains_all behaviour based on hts.

If the esm2 outputs are still available, could you possibly provide the related files within data/esm2_output?

Otherwise, I would try looking into the outputs provided from the command for embeddings_path in embeddings_paths: lm_embeddings_chains.append(torch.load(embeddings_path)['representations'][33])

If none of the original environment are available, I would suggest that the flag '--truncation_seq_length' in the esm/scripts/extract.py could be your issue, and you should try to replicate the issue with a large protein on the PDB, such as
this.

Best of luck!

@liyue9129
Copy link
Author

Hi!
It is so nice of you to provide such a clear stack trace !

I've tried to reproduce the issue and provide the ESM2 outputs.
To do so, I modified the path where preprocessed data is saved in run_single_protein_inference.py and found that the error no longer occurred.

Therefore, it is believed that when performing many protein-ligand docking tasks, saving all preprocessed data in a single folder might lead to confusion and inconsistencies.

Thus, it is suggested that adding a path prefix, data_pre_dir = f'{args.results}/{args.header}', before all data used in run_single_protein_inference.py: f"{data_pre_dir}/data".

I apologize for taking up your valuable time.

Best wishes!

@patjiang
Copy link

Yeah, that makes sense! I wish you the best in your future use, have a good day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants