Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging behavior since GA fix #2004

Closed
6 of 8 tasks
ccdv-ai opened this issue Oct 30, 2024 · 10 comments
Closed
6 of 8 tasks

Logging behavior since GA fix #2004

ccdv-ai opened this issue Oct 30, 2024 · 10 comments
Assignees
Labels

Comments

@ccdv-ai
Copy link

ccdv-ai commented Oct 30, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Since GA fix (#1980), logging does not average loss and grad norm values over accumulation steps, they are summed instead which makes the comparison difficult between different values of GA.
i.e for 8 accumulation steps
{'loss': 7.9071, 'grad_norm': 6.211667537689209, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}
should be
{'loss': 0.988, 'grad_norm': 0.776, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}

Current behaviour

Loss and grad norm are summed

Steps to reproduce

Any training process

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@ccdv-ai ccdv-ai added the bug Something isn't working label Oct 30, 2024
@NanoCode012
Copy link
Collaborator

Hey, just putting some quick thoughts before I look into this in more detail tomorrow.

When I was comparing results before and after for this PR, I notice that the results after are better. Do you have wandb charts to compare?

*-pre: before.

image

image

Note: The results above are from completion. I didn't compare sft.

@ccdv-ai
Copy link
Author

ccdv-ai commented Oct 30, 2024

I think the training is fine, but the logged values are somehow wrong for SFT. I dont have a chart but I have some values.
I use packed training with qwen 2.5, batch of 262 144 tokens:

  • starting loss pre patch (GA=2) : 1.3564
  • starting loss post patch (GA=2) : 2.715
  • starting loss post patch (GA=8) : 10. 848

If I divide the values of GA=8 by 4, I get something very close to GA=2 at the start and at the end of training.

@Gryphe
Copy link

Gryphe commented Oct 31, 2024

Can confirm this appears to be a strictly visual issue - Eval (and testing afterwards) shows the model is learning accordingly. I was using a GA of 4 and started each run with loss values in the 5-6 range, which when divided matches my usual training runs. (SFT, Llama 8B)

@jackswl
Copy link

jackswl commented Nov 3, 2024

can confirm on this. the actual loss should be divided by GA

@NanoCode012
Copy link
Collaborator

NanoCode012 commented Nov 4, 2024

I ran some non-packing tests and couldn't see this. Can someone provide an example config?

*-pre runs are from 1d6a5e2bd638778a42d757ff0cb600f918eb1c31 1d6a5e2

image

image

Edit: Added packing tests.

image

image

@jackswl
Copy link

jackswl commented Nov 4, 2024

@NanoCode012 i think its more of comparing between different tuners. For example, if i use another package such as Unsloth, the loss is actually the loss of axolotl divided by the number of GA, despite everything else being identical. As such, like what others have mentioned, the loss in axolotl is not correct.

i have a feeling that the loss in axolotl is not divided by number of GA.

@ccdv-ai
Copy link
Author

ccdv-ai commented Nov 8, 2024

Updating transformers to 4.46.2 and liger to 0.4.0 fix it for me.

@NanoCode012
Copy link
Collaborator

@ccdv-ai , could you share how the logs look?

@NanoCode012
Copy link
Collaborator

@jackswl , I’m running a few sft trl tests for comparison, but would you perhaps have a comparison against unsloth?

@NanoCode012
Copy link
Collaborator

NanoCode012 commented Nov 19, 2024

@jackswl @ccdv-ai

Sorry this took a while. This is the comparison between trl and axolotl sft (trl runs has *-trl in its name). I tried to keep as much hyp the same, but there are still some differences with handling of prompt masking etc.

However, you can see how, increasing the GA does not increase the loss multiple times in axolotl. Trl's loss also ranges around the same amount when varying mbs and GA.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants