Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only output one prediction for a batch of input #120

Open
chunqiuxia-x opened this issue Dec 23, 2024 · 3 comments
Open

Only output one prediction for a batch of input #120

chunqiuxia-x opened this issue Dec 23, 2024 · 3 comments

Comments

@chunqiuxia-x
Copy link

Hi everyone,

We update the code to the release v0.4.0 from v0.3.2. We find that when we set the input path of a directory which contains several .yaml files, we can get prediction for only one of input .yaml file. We notice that MSA files of all input files are generated and several GPUs are used for modelling. Thus, we guess that all of complexes are modelled but only one is saved.

@jwohlwend
Copy link
Owner

Is it possible that some went OOM? Can you paste here the logs?

@chunqiuxia-x
Copy link
Author

We didn't see any information about the OOM. The log is as follows.

Checking input data.
Running predictions for 3 structures
Processing input data.
0%| | 0/3 [00:00<?, ?it/s]Generating MSA for /job/data/in/8ew6.yaml with 2 protein entities.
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:11 remaining: 00:00]
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:09 remaining: 00:00]
33%|███████████████████████████████████████████ | 1/3 [00:22<00:45, 22.73s/it]Generating MSA for /job/data/in/8hs2.yaml with 3 protein entities.
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:02 remaining: 00:00]
67%|██████████████████████████████████████████████████████████████████████████████████████ | 2/3 [00:30<00:13, 13.68s/it]Generating MSA for /job/data/in/ligand.yaml with 1 protein entities.
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [elapsed: 00:02 remaining: 00:00]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:32<00:00, 10.96s/it]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, tensorboardX has been removed as a dependency of the pytorch_lightning package, due to potential conflicts with other packages in the ML ecosystem. For this reason, logger=True will use CSVLogger as the default logger, unless the tensorboard or tensorboardX packages are found. Please pip install lightning[extra] or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A800 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Checking input data.
Running predictions for 3 structures
Checking input data.
Running predictions for 3 structures
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:37<00:00, 0.01it/s]Number of failed examples: 0
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:37<00:00, 0.01it/s]

@chunqiuxia-x
Copy link
Author

Hi @jwohlwend , I have found the reason why this issue occurs. When you set --devices to a value larger than 1, the main process may finish before the subprocesses if the task on local_rank 0 has been completed. If you are running the program in a Docker container, the container will stop immediately after the main process finishes. To resolve this, you can add synchronization after train.predict() in main.py as shown below.

    # Compute predictions
    trainer.predict(
        model_module,
        datamodule=data_module,
        return_predictions=False,
    )

    trainer.strategy.barrier()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants