You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thank you for developing,I'm getting this error when saving checkpoints, I've attached the log below, also this training process seems to break every 20 hours or so of running for unknown reasons. Is there anything that can be done to improve this? tyty
Epoch 0: 16%|█▌ | 7395/45743 [13:13:43<68:36:00, 6.44s/it, train_loss: 0.084, avg_loss: 0.080][rank3]: Traceback (most recent call last):
[rank3]: File "/workspace/naifu/trainer.py", line 58, in
[rank3]: main()
[rank3]: File "/workspace/naifu/trainer.py", line 54, in main
[rank3]: Trainer(fabric, config).train_loop()
[rank3]: File "/workspace/naifu/common/trainer.py", line 316, in train_loop
[rank3]: self.on_post_training_batch()
[rank3]: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch
[rank3]: self.perform_sampling(is_last=is_last)
[rank3]: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling
[rank3]: os.makedirs(sampling_cfg.save_dir, exist_ok=True)
[rank3]: File "/usr/lib/python3.10/os.py", line 215, in makedirs
[rank3]: makedirs(head, exist_ok=exist_ok)
[rank3]: File "/usr/lib/python3.10/os.py", line 225, in makedirs
[rank3]: mkdir(name, mode)
[rank3]: OSError: [Errno 5] Input/output error: '/app/naifu555'
[rank: 3] Child process with PID 170 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/naifu/trainer.py", line 58, in
[rank0]: main()
[rank0]: File "/workspace/naifu/trainer.py", line 54, in main
[rank0]: Trainer(fabric, config).train_loop()
[rank0]: File "/workspace/naifu/common/trainer.py", line 316, in train_loop
[rank0]: self.on_post_training_batch()
[rank0]: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch
[rank0]: self.perform_sampling(is_last=is_last)
[rank0]: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling
[rank0]: os.makedirs(sampling_cfg.save_dir, exist_ok=True)
[rank0]: File "/usr/lib/python3.10/os.py", line 215, in makedirs
[rank0]: makedirs(head, exist_ok=exist_ok)
[rank0]: File "/usr/lib/python3.10/os.py", line 225, in makedirs
[rank0]: mkdir(name, mode)
[rank0]: OSError: [Errno 5] Input/output error: '/app/naifu555'
The text was updated successfully, but these errors were encountered:
thank you for developing,I'm getting this error when saving checkpoints, I've attached the log below, also this training process seems to break every 20 hours or so of running for unknown reasons. Is there anything that can be done to improve this? tyty
Epoch 0: 16%|█▌ | 7395/45743 [13:13:43<68:36:00, 6.44s/it, train_loss: 0.084, avg_loss: 0.080][rank3]: Traceback (most recent call last):
[rank3]: File "/workspace/naifu/trainer.py", line 58, in
[rank3]: main()
[rank3]: File "/workspace/naifu/trainer.py", line 54, in main
[rank3]: Trainer(fabric, config).train_loop()
[rank3]: File "/workspace/naifu/common/trainer.py", line 316, in train_loop
[rank3]: self.on_post_training_batch()
[rank3]: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch
[rank3]: self.perform_sampling(is_last=is_last)
[rank3]: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling
[rank3]: os.makedirs(sampling_cfg.save_dir, exist_ok=True)
[rank3]: File "/usr/lib/python3.10/os.py", line 215, in makedirs
[rank3]: makedirs(head, exist_ok=exist_ok)
[rank3]: File "/usr/lib/python3.10/os.py", line 225, in makedirs
[rank3]: mkdir(name, mode)
[rank3]: OSError: [Errno 5] Input/output error: '/app/naifu555'
[rank: 3] Child process with PID 170 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/naifu/trainer.py", line 58, in
[rank0]: main()
[rank0]: File "/workspace/naifu/trainer.py", line 54, in main
[rank0]: Trainer(fabric, config).train_loop()
[rank0]: File "/workspace/naifu/common/trainer.py", line 316, in train_loop
[rank0]: self.on_post_training_batch()
[rank0]: File "/workspace/naifu/common/trainer.py", line 47, in on_post_training_batch
[rank0]: self.perform_sampling(is_last=is_last)
[rank0]: File "/workspace/naifu/common/trainer.py", line 167, in perform_sampling
[rank0]: os.makedirs(sampling_cfg.save_dir, exist_ok=True)
[rank0]: File "/usr/lib/python3.10/os.py", line 215, in makedirs
[rank0]: makedirs(head, exist_ok=exist_ok)
[rank0]: File "/usr/lib/python3.10/os.py", line 225, in makedirs
[rank0]: mkdir(name, mode)
[rank0]: OSError: [Errno 5] Input/output error: '/app/naifu555'
The text was updated successfully, but these errors were encountered: