Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint saving error #35

Open
X-MAXXIX opened this issue Oct 1, 2024 · 1 comment
Open

checkpoint saving error #35

X-MAXXIX opened this issue Oct 1, 2024 · 1 comment

Comments

@X-MAXXIX
Copy link

X-MAXXIX commented Oct 1, 2024

thank you for developing,I'm getting this error when saving checkpoints, I've attached the log below, also this training process seems to break every 20 hours or so of running for unknown reasons. Is there anything that can be done to improve this? tyty

Epoch 0: 16%|█▌ | 7395/45743 [13:13:43<68:36:00, 6.44s/it, train_loss: 0.084, avg_loss: 0.080][rank3]: Traceback (most recent call last):
[rank3]: File "/workspace/naifu/trainer.py", line 58, in
[rank3]: main()
[rank3]: File "/workspace/naifu/trainer.py", line 54, in main
[rank3]: Trainer(fabric, config).train_loop()
[rank3]: File "/workspace/naifu/common/trainer.py", line 316, in train_loop
[rank3]: self.on_post_training_batch()
[rank3]: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch
[rank3]: self.perform_sampling(is_last=is_last)
[rank3]: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling
[rank3]: os.makedirs(sampling_cfg.save_dir, exist_ok=True)
[rank3]: File "/usr/lib/python3.10/os.py", line 215, in makedirs
[rank3]: makedirs(head, exist_ok=exist_ok)
[rank3]: File "/usr/lib/python3.10/os.py", line 225, in makedirs
[rank3]: mkdir(name, mode)
[rank3]: OSError: [Errno 5] Input/output error: '/app/naifu555'
[rank: 3] Child process with PID 170 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/naifu/trainer.py", line 58, in
[rank0]: main()
[rank0]: File "/workspace/naifu/trainer.py", line 54, in main
[rank0]: Trainer(fabric, config).train_loop()
[rank0]: File "/workspace/naifu/common/trainer.py", line 316, in train_loop
[rank0]: self.on_post_training_batch()
[rank0]: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch
[rank0]: self.perform_sampling(is_last=is_last)
[rank0]: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling
[rank0]: os.makedirs(sampling_cfg.save_dir, exist_ok=True)
[rank0]: File "/usr/lib/python3.10/os.py", line 215, in makedirs
[rank0]: makedirs(head, exist_ok=exist_ok)
[rank0]: File "/usr/lib/python3.10/os.py", line 225, in makedirs
[rank0]: mkdir(name, mode)
[rank0]: OSError: [Errno 5] Input/output error: '/app/naifu555'

@Mikubill
Copy link
Owner

Mikubill commented Oct 1, 2024

You may need to specify an existing path for sample storage

save_dir: "samples"

or completely disable it by setting sample.enabled = False
enabled: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants