Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"closing signal SIGTERM", seems like RAM OOM? #512

Open
colian opened this issue Oct 28, 2024 · 2 comments
Open

"closing signal SIGTERM", seems like RAM OOM? #512

colian opened this issue Oct 28, 2024 · 2 comments

Comments

@colian
Copy link

colian commented Oct 28, 2024

Hi, when i run the latest v1.3 code for fine-tuning, it fails the training every-time when the program tries to save the checkpoint, as shown below. I have never met this issue when running the previous version. Do you know why? Seems like the RAM is OOM after i search some related issues, do you know how to solve it?

[2024-10-28 06:14:36,059] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598579 closing signal SIGTERM
[2024-10-28 06:14:36,060] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598580 closing signal SIGTERM
[2024-10-28 06:14:36,060] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598581 closing signal SIGTERM
[2024-10-28 06:14:36,061] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598583 closing signal SIGTERM
[2024-10-28 06:14:36,061] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598584 closing signal SIGTERM
[2024-10-28 06:14:36,062] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598586 closing signal SIGTERM
[2024-10-28 06:14:36,062] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 598587 closing signal SIGTERM
[2024-10-28 06:14:37,244] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 598577) of binary: /opt/conda/bin/python

@LinB203
Copy link
Member

LinB203 commented Oct 28, 2024

Can you put up some info on monitoring memory? Do you have a lot of data? How long have you been training and are there any memory leaks?

@colian
Copy link
Author

colian commented Oct 30, 2024

@LinB203 Sry I did not see your fast reply!!, the monitoring memory seems normal, maybe the wandb has not capture the broken steps. I don't have much data (less than 1m). The training time is just few hours. So you never meet this kind of issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants