OPRO, DPO don't work with Mixtral-8x22B. FSDP + QLORA & bigstral-ds-zero3 #1534

0-hero · 2024-04-18T04:31:50Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

DPO/ORPO training should run successfully

Current behaviour

Models tested (don't have any issues with inference)

mistral-community/Mixtral-8x22B-v0.1
0-hero/Matter-0.2-8x22B (FSDP + QLORA finetune of the above)

Machines tested (tried each type from multiple providers)

8xA100 (80GB)
8xH100 (SXM)

Images tested

winglian/axolotl-runpod:main-latest (as of 18 Apr 2024)
winglian/axolotl-runpod:main-py3.11-cu121-2.1.2
winglian/axolotl-runpod:main-py3.10-cu118-2.1.2
runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04

Ran tests with all the combinations mentioned above

Both issues mentioned below happen for both ORPO & DPO

Issue 1 - FSDP + QLORA

#1494

Issue 2 - bigstral-ds-zero3

Happened anytime before the first 20 steps. Tried reducing the below to 1 but the issue persists.

gradient_accumulation_steps: 1
micro_batch_size: 1

Training hangs and eventually stops with a NCCL timeout huggingface/accelerate#314
GPU util also falls once it hangs, example below

Steps to reproduce

Start training with any of the configs below

FSDP + QLORA config

base_model: mistral-community/Mixtral-8x22B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

rl: dpo
datasets:
  - path: argilla/ultrafeedback-binarized-preferences-cleaned
    split: train
    type: chatml.ultra

dpo_beta: 0.1

chat_template: chatml
default_system_message: You are a helpful assistant

dataset_prepared_path: data
val_set_size: 0
output_dir: output

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: false

adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:
- embed_tokens
- lm_head

gradient_accumulation_steps: 8
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
weight_decay: 0.0
fsdp:
   - full_shard
   - auto_wrap
fsdp_config:
 fsdp_limit_all_gathers: true
 fsdp_sync_module_states: true
 fsdp_offload_params: true
 fsdp_use_orig_params: false
 fsdp_cpu_ram_efficient_loading: true
 fsdp_transformer_layer_cls_to_wrap: MixtralSparseMoeBlock
 fsdp_state_dict_type: FULL_STATE_DICT
 fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP

special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|begin_func|>"
  - "<|end_func|>"
  - "<|begin_func_response|>"
  - "<|end_func_response|>"
  - "<|im_start|>"
  - "<|im_end|>"

bigstral-ds-zero3 config

base_model: 0-hero/Matter-0.2-8x22B
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

unfrozen_parameters:
  - ^lm_head.weight$
  - ^model.embed_tokens.weight$
  - model.layers.4[4-9]+.block_sparse_moe.gate
  - model.layers.4[4-9]+.block_sparse_moe.experts
  - model.layers.5[0-5]+.block_sparse_moe.gate
  - model.layers.5[0-5]+.block_sparse_moe.experts

model_config:
  output_router_logits: true

rl: orpo
datasets:
  - path: mlabonne/orpo-mix-40k
    split: train
    type: orpo.chat_template

chat_template: chatml
default_system_message: You are a helpful assistant

dataset_prepared_path: data
val_set_size: 0
output_dir: output

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: false

gradient_accumulation_steps: 8
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
save_total_limit: 1
save_steps:
debug:
deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_params.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|begin_func|>"
  - "<|end_func|>"
  - "<|begin_func_response|>"
  - "<|end_func_response|>"
  - "<|im_start|>"
  - "<|im_end|>"

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10,3.11

axolotl branch-commit

main/0eadfc8

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

0-hero · 2024-04-18T04:35:53Z

@winglian raised as new issue as mentioned in the other discussion

0-hero added the bug Something isn't working label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPRO, DPO don't work with Mixtral-8x22B. FSDP + QLORA & bigstral-ds-zero3 #1534

OPRO, DPO don't work with Mixtral-8x22B. FSDP + QLORA & bigstral-ds-zero3 #1534

0-hero commented Apr 18, 2024

0-hero commented Apr 18, 2024 •

edited

Loading

OPRO, DPO don't work with Mixtral-8x22B. FSDP + QLORA & bigstral-ds-zero3 #1534

OPRO, DPO don't work with Mixtral-8x22B. FSDP + QLORA & bigstral-ds-zero3 #1534

Comments

0-hero commented Apr 18, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Models tested (don't have any issues with inference)

Machines tested (tried each type from multiple providers)

Images tested

Both issues mentioned below happen for both ORPO & DPO

Issue 1 - FSDP + QLORA

Issue 2 - bigstral-ds-zero3

Steps to reproduce

FSDP + QLORA config

bigstral-ds-zero3 config

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

0-hero commented Apr 18, 2024 • edited Loading

0-hero commented Apr 18, 2024 •

edited

Loading