Only output one prediction for a batch of input #120

chunqiuxia-x · 2024-12-23T03:26:53Z

Hi everyone,

We update the code to the release v0.4.0 from v0.3.2. We find that when we set the input path of a directory which contains several .yaml files, we can get prediction for only one of input .yaml file. We notice that MSA files of all input files are generated and several GPUs are used for modelling. Thus, we guess that all of complexes are modelled but only one is saved.

The text was updated successfully, but these errors were encountered:

jwohlwend · 2024-12-23T11:50:07Z

Is it possible that some went OOM? Can you paste here the logs?

chunqiuxia-x · 2024-12-24T01:22:52Z

We didn't see any information about the OOM. The log is as follows.

Checking input data.
Running predictions for 3 structures
Processing input data.
0%| | 0/3 [00:00<?, ?it/s]Generating MSA for /job/data/in/8ew6.yaml with 2 protein entities.
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:11 remaining: 00:00]
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:09 remaining: 00:00]
33%|███████████████████████████████████████████ | 1/3 [00:22<00:45, 22.73s/it]Generating MSA for /job/data/in/8hs2.yaml with 3 protein entities.
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:02 remaining: 00:00]
67%|██████████████████████████████████████████████████████████████████████████████████████ | 2/3 [00:30<00:13, 13.68s/it]Generating MSA for /job/data/in/ligand.yaml with 1 protein entities.
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [elapsed: 00:02 remaining: 00:00]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:32<00:00, 10.96s/it]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, tensorboardX has been removed as a dependency of the pytorch_lightning package, due to potential conflicts with other packages in the ML ecosystem. For this reason, logger=True will use CSVLogger as the default logger, unless the tensorboard or tensorboardX packages are found. Please pip install lightning[extra] or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A800 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Checking input data.
Running predictions for 3 structures
Checking input data.
Running predictions for 3 structures
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:37<00:00, 0.01it/s]Number of failed examples: 0
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:37<00:00, 0.01it/s]

chunqiuxia-x · 2024-12-31T09:54:57Z

Hi @jwohlwend , I have found the reason why this issue occurs. When you set --devices to a value larger than 1, the main process may finish before the subprocesses if the task on local_rank 0 has been completed. If you are running the program in a Docker container, the container will stop immediately after the main process finishes. To resolve this, you can add synchronization after train.predict() in main.py as shown below.

    # Compute predictions
    trainer.predict(
        model_module,
        datamodule=data_module,
        return_predictions=False,
    )

    trainer.strategy.barrier()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only output one prediction for a batch of input #120

Only output one prediction for a batch of input #120

chunqiuxia-x commented Dec 23, 2024

jwohlwend commented Dec 23, 2024

chunqiuxia-x commented Dec 24, 2024

chunqiuxia-x commented Dec 31, 2024

Only output one prediction for a batch of input #120

Only output one prediction for a batch of input #120

Comments

chunqiuxia-x commented Dec 23, 2024

jwohlwend commented Dec 23, 2024

chunqiuxia-x commented Dec 24, 2024

chunqiuxia-x commented Dec 31, 2024