"closing signal SIGTERM", seems like RAM OOM? #512

colian · 2024-10-28T08:14:34Z

Hi, when i run the latest v1.3 code for fine-tuning, it fails the training every-time when the program tries to save the checkpoint, as shown below. I have never met this issue when running the previous version. Do you know why? Seems like the RAM is OOM after i search some related issues, do you know how to solve it?

[2024-10-28 06:14:36,059] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598579 closing signal SIGTERM
[2024-10-28 06:14:36,060] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598580 closing signal SIGTERM
[2024-10-28 06:14:36,060] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598581 closing signal SIGTERM
[2024-10-28 06:14:36,061] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598583 closing signal SIGTERM
[2024-10-28 06:14:36,061] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598584 closing signal SIGTERM
[2024-10-28 06:14:36,062] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598586 closing signal SIGTERM
[2024-10-28 06:14:36,062] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598587 closing signal SIGTERM
[2024-10-28 06:14:37,244] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 598577) of binary: /opt/conda/bin/python

LinB203 · 2024-10-28T09:06:28Z

Can you put up some info on monitoring memory? Do you have a lot of data? How long have you been training and are there any memory leaks?

colian · 2024-10-30T02:36:57Z

@LinB203 Sry I did not see your fast reply!!, the monitoring memory seems normal, maybe the wandb has not capture the broken steps. I don't have much data (less than 1m). The training time is just few hours. So you never meet this kind of issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"closing signal SIGTERM", seems like RAM OOM? #512

"closing signal SIGTERM", seems like RAM OOM? #512

colian commented Oct 28, 2024 •

edited

Loading

LinB203 commented Oct 28, 2024

colian commented Oct 30, 2024

"closing signal SIGTERM", seems like RAM OOM? #512

"closing signal SIGTERM", seems like RAM OOM? #512

Comments

colian commented Oct 28, 2024 • edited Loading

LinB203 commented Oct 28, 2024

colian commented Oct 30, 2024

colian commented Oct 28, 2024 •

edited

Loading