You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I recently tested out the training script with --checkpointing_epoch=1 and gradient_accumulation_steps=1. I have a total dataset size of 10k with batch size of 5. This means that the script should be saving a checkpoint every 2000 steps. However, the checkpoint folder created is checkpoint-4000 and is being saved when the progress bar shows 4000 steps.
The text was updated successfully, but these errors were encountered:
Hi, I recently tested out the training script with
--checkpointing_epoch=1
andgradient_accumulation_steps=1
. I have a total dataset size of 10k with batch size of 5. This means that the script should be saving a checkpoint every 2000 steps. However, the checkpoint folder created ischeckpoint-4000
and is being saved when the progress bar shows 4000 steps.The text was updated successfully, but these errors were encountered: