LORA training broken on Mistral Nemo. Massive loss values immediately. #2039

Nero10578 · 2024-11-12T07:56:08Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Expected loss value should be in the 1.x even at the beginning. This worked fine in older commits of axolotl, but sadly can't pinpoint up to which commit. Just that the recent one it is broken.

Current behaviour

The loss value immediately starts at 23-24. Just before training Nemo, I was training Qwen2.5 32B Instruct which works perfectly fine on the other hand.

tran/loss is really high:

eval/loss somehow looks alright?

grad/norm also looks really high:

Steps to reproduce

Train Mistral Nemo 12B Instruct using LORA+. Loss value immediately is not right. Originally I tried using RSLoRA and Liger kernels enabled, which worked for Qwen but the loss was broken. So I disabled them so that the config matches what I used that worked for Mistral back a few weeks ago except for using chat_templates. This is the example config I am showing here and it is still broken. So something is fundamentally broken with Mistral Nemo training now.

In order to get mistral to work with the new chat_templates, I also had to modify the chat template in the model tokenizer_config.json so that it doesn't throw errors when the order of the conversation isn't exactly as mistral expects. Previously using sharegpt fastchat method this never was a problem. The chat template in the tokenizer config is changed to this:

{%- for message in messages %}{%- if message['role'] == 'system' -%}{{- message['content'] -}}{%- else -%}{%- if message['role'] == 'user' -%}{{-'[INST] ' + message['content'].rstrip() + ' [/INST]'-}}{%- else -%}{{-'' + message['content'] + '</s>' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{-''-}}{%- endif -%}

Then just run preprocess with --debug option to tokenize dataset first, and then run the training.

Config yaml

base_model: /home/user/models/Mistral-Nemo-Instruct-2407
model_type: AutoModelForCausalLM

train_on_inputs: false
group_by_length: false
load_in_8bit:
load_in_4bit: false
strict: false
sequence_len: 8192
bf16: auto
flash_attention: true

shuffle_merged_datasets: true

#Data
datasets:
  - path: /home/user/datasets/conversations-escaped.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value

warmup_steps: 10
dataset_prepared_path: ./lora_last_run_prepared

# Iterations
num_epochs: 1
saves_per_epoch: 8
saves_total_limit: 8

# Evaluation
val_set_size: 0.0025
eval_max_new_tokens: 128
eval_sample_packing: false
evals_per_epoch: 8
eval_table_size:

# LoRA
output_dir: ./lora_out
adapter: lora
lora_model_dir:
lora_r: 64
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

save_safetensors: true

loraplus_lr_ratio: 16

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 16
micro_batch_size: 1
gradient_checkpointing: unsloth

# wandb
wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: mistral-nemo
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: nemo-8192
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training

# Optimizer
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00002

# Misc
auto_resume_from_checkpoints: true
logging_steps: 1
weight_decay: 0.0

special_tokens:
  pad_token: <pad>

# Multi-GPU
deepspeed:
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: MistralDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11

axolotl branch-commit

d356740

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

Nero10578 · 2024-11-12T08:01:16Z

I guess it is related issue to #2004 but somehow Qwen behaves perfectly fine. Also just tested Llama 3.1 8b it works perfectly fine.

winglian · 2024-11-12T15:01:57Z

I believe this will be resolved with transformers 4.46.2, but that version doesn't work with gradient accumulation and FSDP

bursteratom · 2024-11-12T15:04:54Z

@winglian we can wait till 4.47 comes out right? Hopefully by then they will have incorporated your fix

Nero10578 · 2024-11-12T21:04:42Z

Is this broken functionally or does it still train normally?

winglian · 2024-11-13T18:27:40Z

I've seen reports that even when the loss values are scaled, that it doesn't learn properly iirc.

Nero10578 · 2024-11-13T20:25:32Z

Will just wait for the fix then. :( Thanks.

e-p-armstrong · 2024-11-13T23:02:43Z

I've also had massive loss values when training old mistral v0.2s with high grad accum recently, it looks like it's not actually my fault but instead an issue with transformers? Do I understand this correctly? So the solution is maybe to rollback versions?

winglian · 2024-11-15T15:59:21Z

there's an upstream transformers fix waiting to be merged, but in the meantime, it seems that people have reported that 4.46.2 resolves the issue. I'll merge #2064 in the meantime as a workaround.

Nero10578 · 2024-11-15T20:44:34Z

there's an upstream transformers fix waiting to be merged, but in the meantime, it seems that people have reported that 4.46.2 resolves the issue. I'll merge #2064 in the meantime as a workaround.

Thanks!

Nero10578 added the bug Something isn't working label Nov 12, 2024

bursteratom self-assigned this Nov 12, 2024

winglian mentioned this issue Nov 15, 2024

Fsdp grad accum monkeypatch #2064

Merged

winglian closed this as completed in #2064 Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LORA training broken on Mistral Nemo. Massive loss values immediately. #2039

LORA training broken on Mistral Nemo. Massive loss values immediately. #2039

Nero10578 commented Nov 12, 2024 •

edited

Loading

Nero10578 commented Nov 12, 2024 •

edited

Loading

winglian commented Nov 12, 2024

bursteratom commented Nov 12, 2024

Nero10578 commented Nov 12, 2024

winglian commented Nov 13, 2024

Nero10578 commented Nov 13, 2024

e-p-armstrong commented Nov 13, 2024

winglian commented Nov 15, 2024

Nero10578 commented Nov 15, 2024

LORA training broken on Mistral Nemo. Massive loss values immediately. #2039

LORA training broken on Mistral Nemo. Massive loss values immediately. #2039

Comments

Nero10578 commented Nov 12, 2024 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

Nero10578 commented Nov 12, 2024 • edited Loading

winglian commented Nov 12, 2024

bursteratom commented Nov 12, 2024

Nero10578 commented Nov 12, 2024

winglian commented Nov 13, 2024

Nero10578 commented Nov 13, 2024

e-p-armstrong commented Nov 13, 2024

winglian commented Nov 15, 2024

Nero10578 commented Nov 15, 2024

Nero10578 commented Nov 12, 2024 •

edited

Loading

Nero10578 commented Nov 12, 2024 •

edited

Loading