Optimizing Training Efficiency by Running Checkpoint Saving in a Separate Thread #312
binhphamthanh
started this conversation in
Ideas
Replies: 1 comment
-
Generally saving with larger interval will reduce the overhead which is fine. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello @SWivid and @lpscr,
In my experience with the training process, I’ve noticed that while the steps themselves run quite smoothly, the
save_checkpoint
phase takes a significant amount of time, which impacts the overall speed and duration of training.Upon reviewing the code in trainer.py, specifically around line 293, I’m considering whether we could run this part in a separate thread. This way, the
save_checkpoint
process could operate independently and would not require the training process to pause.I haven't delved deeply into the code yet, but I'd like to know your thoughts on whether this approach would be feasible.
Thank you for your insights!
Beta Was this translation helpful? Give feedback.
All reactions