You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, when i run the latest v1.3 code for fine-tuning, it fails the training every-time when the program tries to save the checkpoint, as shown below. I have never met this issue when running the previous version. Do you know why? Seems like the RAM is OOM after i search some related issues, do you know how to solve it?
[2024-10-28 06:14:36,059] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598579 closing signal SIGTERM
[2024-10-28 06:14:36,060] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598580 closing signal SIGTERM
[2024-10-28 06:14:36,060] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598581 closing signal SIGTERM
[2024-10-28 06:14:36,061] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598583 closing signal SIGTERM
[2024-10-28 06:14:36,061] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598584 closing signal SIGTERM
[2024-10-28 06:14:36,062] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598586 closing signal SIGTERM
[2024-10-28 06:14:36,062] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598587 closing signal SIGTERM
[2024-10-28 06:14:37,244] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 598577) of binary: /opt/conda/bin/python
The text was updated successfully, but these errors were encountered:
@LinB203 Sry I did not see your fast reply!!, the monitoring memory seems normal, maybe the wandb has not capture the broken steps. I don't have much data (less than 1m). The training time is just few hours. So you never meet this kind of issue?
Hi, when i run the latest v1.3 code for fine-tuning, it fails the training every-time when the program tries to save the checkpoint, as shown below. I have never met this issue when running the previous version. Do you know why? Seems like the RAM is OOM after i search some related issues, do you know how to solve it?
[2024-10-28 06:14:36,059] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598579 closing signal SIGTERM
[2024-10-28 06:14:36,060] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598580 closing signal SIGTERM
[2024-10-28 06:14:36,060] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598581 closing signal SIGTERM
[2024-10-28 06:14:36,061] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598583 closing signal SIGTERM
[2024-10-28 06:14:36,061] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598584 closing signal SIGTERM
[2024-10-28 06:14:36,062] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598586 closing signal SIGTERM
[2024-10-28 06:14:36,062] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598587 closing signal SIGTERM
[2024-10-28 06:14:37,244] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 598577) of binary: /opt/conda/bin/python
The text was updated successfully, but these errors were encountered: