Logging behavior since GA fix #2004

ccdv-ai · 2024-10-30T11:14:05Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Since GA fix (#1980), logging does not average loss and grad norm values over accumulation steps, they are summed instead which makes the comparison difficult between different values of GA.
i.e for 8 accumulation steps
{'loss': 7.9071, 'grad_norm': 6.211667537689209, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}
should be
{'loss': 0.988, 'grad_norm': 0.776, 'learning_rate': 4.524625433624047e-05, 'epoch': 0.52}

Current behaviour

Loss and grad norm are summed

Steps to reproduce

Any training process

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2024-10-30T14:33:02Z

Hey, just putting some quick thoughts before I look into this in more detail tomorrow.

When I was comparing results before and after for this PR, I notice that the results after are better. Do you have wandb charts to compare?

*-pre: before.

Note: The results above are from completion. I didn't compare sft.

ccdv-ai · 2024-10-30T14:45:37Z

I think the training is fine, but the logged values are somehow wrong for SFT. I dont have a chart but I have some values.
I use packed training with qwen 2.5, batch of 262 144 tokens:

starting loss pre patch (GA=2) : 1.3564
starting loss post patch (GA=2) : 2.715
starting loss post patch (GA=8) : 10. 848

If I divide the values of GA=8 by 4, I get something very close to GA=2 at the start and at the end of training.

Gryphe · 2024-10-31T07:53:45Z

Can confirm this appears to be a strictly visual issue - Eval (and testing afterwards) shows the model is learning accordingly. I was using a GA of 4 and started each run with loss values in the 5-6 range, which when divided matches my usual training runs. (SFT, Llama 8B)

jackswl · 2024-11-03T11:23:44Z

can confirm on this. the actual loss should be divided by GA

NanoCode012 · 2024-11-04T12:08:35Z

I ran some non-packing tests and couldn't see this. Can someone provide an example config?

*-pre runs are from 1d6a5e2bd638778a42d757ff0cb600f918eb1c31 1d6a5e2

Edit: Added packing tests.

jackswl · 2024-11-04T13:55:47Z

@NanoCode012 i think its more of comparing between different tuners. For example, if i use another package such as Unsloth, the loss is actually the loss of axolotl divided by the number of GA, despite everything else being identical. As such, like what others have mentioned, the loss in axolotl is not correct.

i have a feeling that the loss in axolotl is not divided by number of GA.

ccdv-ai · 2024-11-08T13:18:59Z

Updating transformers to 4.46.2 and liger to 0.4.0 fix it for me.

NanoCode012 · 2024-11-08T15:10:03Z

@ccdv-ai , could you share how the logs look?

NanoCode012 · 2024-11-08T15:11:22Z

@jackswl , I’m running a few sft trl tests for comparison, but would you perhaps have a comparison against unsloth?

NanoCode012 · 2024-11-19T10:14:44Z

@jackswl @ccdv-ai

Sorry this took a while. This is the comparison between trl and axolotl sft (trl runs has *-trl in its name). I tried to keep as much hyp the same, but there are still some differences with handling of prompt masking etc.

However, you can see how, increasing the GA does not increase the loss multiple times in axolotl. Trl's loss also ranges around the same amount when varying mbs and GA.

ccdv-ai added the bug Something isn't working label Oct 30, 2024

NanoCode012 added the under review label Oct 30, 2024

NanoCode012 added the waiting for reporter label Oct 30, 2024

NanoCode012 added waiting for reporter and removed waiting for reporter labels Nov 4, 2024

NanoCode012 self-assigned this Nov 4, 2024

Nero10578 mentioned this issue Nov 12, 2024

LORA training broken on Mistral Nemo. Massive loss values immediately. #2039

Closed

8 tasks

bursteratom closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging behavior since GA fix #2004

Logging behavior since GA fix #2004

ccdv-ai commented Oct 30, 2024

NanoCode012 commented Oct 30, 2024

ccdv-ai commented Oct 30, 2024

Gryphe commented Oct 31, 2024 •

edited

Loading

jackswl commented Nov 3, 2024

NanoCode012 commented Nov 4, 2024 •

edited

Loading

jackswl commented Nov 4, 2024 •

edited

Loading

ccdv-ai commented Nov 8, 2024

NanoCode012 commented Nov 8, 2024

NanoCode012 commented Nov 8, 2024

NanoCode012 commented Nov 19, 2024 •

edited

Loading

Logging behavior since GA fix #2004

Logging behavior since GA fix #2004

Comments

ccdv-ai commented Oct 30, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Oct 30, 2024

ccdv-ai commented Oct 30, 2024

Gryphe commented Oct 31, 2024 • edited Loading

jackswl commented Nov 3, 2024

NanoCode012 commented Nov 4, 2024 • edited Loading

jackswl commented Nov 4, 2024 • edited Loading

ccdv-ai commented Nov 8, 2024

NanoCode012 commented Nov 8, 2024

NanoCode012 commented Nov 8, 2024

NanoCode012 commented Nov 19, 2024 • edited Loading

Gryphe commented Oct 31, 2024 •

edited

Loading

NanoCode012 commented Nov 4, 2024 •

edited

Loading

jackswl commented Nov 4, 2024 •

edited

Loading

NanoCode012 commented Nov 19, 2024 •

edited

Loading