Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LORA training broken on Mistral Nemo. Massive loss values immediately. #2039

Closed
6 of 8 tasks
Nero10578 opened this issue Nov 12, 2024 · 9 comments · Fixed by #2064
Closed
6 of 8 tasks

LORA training broken on Mistral Nemo. Massive loss values immediately. #2039

Nero10578 opened this issue Nov 12, 2024 · 9 comments · Fixed by #2064
Assignees
Labels
bug Something isn't working

Comments

@Nero10578
Copy link
Contributor

Nero10578 commented Nov 12, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Expected loss value should be in the 1.x even at the beginning. This worked fine in older commits of axolotl, but sadly can't pinpoint up to which commit. Just that the recent one it is broken.

Current behaviour

The loss value immediately starts at 23-24. Just before training Nemo, I was training Qwen2.5 32B Instruct which works perfectly fine on the other hand.

tran/loss is really high:
image

eval/loss somehow looks alright?
image

grad/norm also looks really high:
image

Steps to reproduce

Train Mistral Nemo 12B Instruct using LORA+. Loss value immediately is not right. Originally I tried using RSLoRA and Liger kernels enabled, which worked for Qwen but the loss was broken. So I disabled them so that the config matches what I used that worked for Mistral back a few weeks ago except for using chat_templates. This is the example config I am showing here and it is still broken. So something is fundamentally broken with Mistral Nemo training now.

In order to get mistral to work with the new chat_templates, I also had to modify the chat template in the model tokenizer_config.json so that it doesn't throw errors when the order of the conversation isn't exactly as mistral expects. Previously using sharegpt fastchat method this never was a problem. The chat template in the tokenizer config is changed to this:

{%- for message in messages %}{%- if message['role'] == 'system' -%}{{- message['content'] -}}{%- else -%}{%- if message['role'] == 'user' -%}{{-'[INST] ' + message['content'].rstrip() + ' [/INST]'-}}{%- else -%}{{-'' + message['content'] + '</s>' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{-''-}}{%- endif -%}

Then just run preprocess with --debug option to tokenize dataset first, and then run the training.

Config yaml

base_model: /home/user/models/Mistral-Nemo-Instruct-2407
model_type: AutoModelForCausalLM

train_on_inputs: false
group_by_length: false
load_in_8bit:
load_in_4bit: false
strict: false
sequence_len: 8192
bf16: auto
flash_attention: true

shuffle_merged_datasets: true

#Data
datasets:
  - path: /home/user/datasets/conversations-escaped.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value

warmup_steps: 10
dataset_prepared_path: ./lora_last_run_prepared

# Iterations
num_epochs: 1
saves_per_epoch: 8
saves_total_limit: 8

# Evaluation
val_set_size: 0.0025
eval_max_new_tokens: 128
eval_sample_packing: false
evals_per_epoch: 8
eval_table_size:

# LoRA
output_dir: ./lora_out
adapter: lora
lora_model_dir:
lora_r: 64
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

save_safetensors: true

loraplus_lr_ratio: 16

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 16
micro_batch_size: 1
gradient_checkpointing: unsloth

# wandb
wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: mistral-nemo
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: nemo-8192
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training

# Optimizer
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00002

# Misc
auto_resume_from_checkpoints: true
logging_steps: 1
weight_decay: 0.0

special_tokens:
  pad_token: <pad>

# Multi-GPU
deepspeed:
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: MistralDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11

axolotl branch-commit

d356740

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@Nero10578 Nero10578 added the bug Something isn't working label Nov 12, 2024
@Nero10578
Copy link
Contributor Author

Nero10578 commented Nov 12, 2024

I guess it is related issue to #2004 but somehow Qwen behaves perfectly fine. Also just tested Llama 3.1 8b it works perfectly fine.

@bursteratom bursteratom self-assigned this Nov 12, 2024
@winglian
Copy link
Collaborator

I believe this will be resolved with transformers 4.46.2, but that version doesn't work with gradient accumulation and FSDP

@bursteratom
Copy link
Collaborator

@winglian we can wait till 4.47 comes out right? Hopefully by then they will have incorporated your fix

@Nero10578
Copy link
Contributor Author

Is this broken functionally or does it still train normally?

@winglian
Copy link
Collaborator

I've seen reports that even when the loss values are scaled, that it doesn't learn properly iirc.

@Nero10578
Copy link
Contributor Author

Will just wait for the fix then. :( Thanks.

@e-p-armstrong
Copy link

I've also had massive loss values when training old mistral v0.2s with high grad accum recently, it looks like it's not actually my fault but instead an issue with transformers? Do I understand this correctly? So the solution is maybe to rollback versions?

@winglian
Copy link
Collaborator

there's an upstream transformers fix waiting to be merged, but in the meantime, it seems that people have reported that 4.46.2 resolves the issue. I'll merge #2064 in the meantime as a workaround.

@Nero10578
Copy link
Contributor Author

there's an upstream transformers fix waiting to be merged, but in the meantime, it seems that people have reported that 4.46.2 resolves the issue. I'll merge #2064 in the meantime as a workaround.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants