-
-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging behavior since GA fix #2004
Comments
Hey, just putting some quick thoughts before I look into this in more detail tomorrow. When I was comparing results before and after for this PR, I notice that the results after are better. Do you have wandb charts to compare?
Note: The results above are from |
I think the training is fine, but the logged values are somehow wrong for SFT. I dont have a chart but I have some values.
If I divide the values of GA=8 by 4, I get something very close to GA=2 at the start and at the end of training. |
Can confirm this appears to be a strictly visual issue - Eval (and testing afterwards) shows the model is learning accordingly. I was using a GA of 4 and started each run with loss values in the 5-6 range, which when divided matches my usual training runs. (SFT, Llama 8B) |
can confirm on this. the actual loss should be divided by GA |
I ran some non-packing tests and couldn't see this. Can someone provide an example config?
Edit: Added packing tests. |
@NanoCode012 i think its more of comparing between different tuners. For example, if i use another package such as Unsloth, the loss is actually the loss of axolotl divided by the number of GA, despite everything else being identical. As such, like what others have mentioned, the loss in axolotl is not correct. i have a feeling that the loss in axolotl is not divided by number of GA. |
Updating transformers to 4.46.2 and liger to 0.4.0 fix it for me. |
@ccdv-ai , could you share how the logs look? |
@jackswl , I’m running a few sft trl tests for comparison, but would you perhaps have a comparison against unsloth? |
Sorry this took a while. This is the comparison between trl and axolotl sft (trl runs has However, you can see how, increasing the GA does not increase the loss multiple times in axolotl. Trl's loss also ranges around the same amount when varying mbs and GA. |
Please check that this issue hasn't been reported before.
Expected Behavior
Since GA fix (#1980), logging does not average loss and grad norm values over accumulation steps, they are summed instead which makes the comparison difficult between different values of GA.
i.e for 8 accumulation steps
{'loss': 7.9071, 'grad_norm': 6.211667537689209, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}
should be
{'loss': 0.988, 'grad_norm': 0.776, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}
Current behaviour
Loss and grad norm are summed
Steps to reproduce
Any training process
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: