Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPRO, DPO don't work with Mixtral-8x22B. FSDP + QLORA & bigstral-ds-zero3 #1534

Open
6 of 8 tasks
0-hero opened this issue Apr 18, 2024 · 1 comment
Open
6 of 8 tasks
Labels
bug Something isn't working

Comments

@0-hero
Copy link
Contributor

0-hero commented Apr 18, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

DPO/ORPO training should run successfully

Current behaviour

Models tested (don't have any issues with inference)

  • mistral-community/Mixtral-8x22B-v0.1
  • 0-hero/Matter-0.2-8x22B (FSDP + QLORA finetune of the above)

Machines tested (tried each type from multiple providers)

  • 8xA100 (80GB)
  • 8xH100 (SXM)

Images tested

  • winglian/axolotl-runpod:main-latest (as of 18 Apr 2024)
  • winglian/axolotl-runpod:main-py3.11-cu121-2.1.2
  • winglian/axolotl-runpod:main-py3.10-cu118-2.1.2
  • runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04

Ran tests with all the combinations mentioned above

Both issues mentioned below happen for both ORPO & DPO

Issue 1 - FSDP + QLORA

#1494

Issue 2 - bigstral-ds-zero3

Happened anytime before the first 20 steps. Tried reducing the below to 1 but the issue persists.

gradient_accumulation_steps: 1
micro_batch_size: 1

Training hangs and eventually stops with a NCCL timeout huggingface/accelerate#314
GPU util also falls once it hangs, example below
Screenshot 2024-04-18 at 9 44 07 AM

Steps to reproduce

Start training with any of the configs below

FSDP + QLORA config

base_model: mistral-community/Mixtral-8x22B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

rl: dpo
datasets:
  - path: argilla/ultrafeedback-binarized-preferences-cleaned
    split: train
    type: chatml.ultra

dpo_beta: 0.1

chat_template: chatml
default_system_message: You are a helpful assistant

dataset_prepared_path: data
val_set_size: 0
output_dir: output

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: false

adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:
- embed_tokens
- lm_head

gradient_accumulation_steps: 8
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
weight_decay: 0.0
fsdp:
   - full_shard
   - auto_wrap
fsdp_config:
 fsdp_limit_all_gathers: true
 fsdp_sync_module_states: true
 fsdp_offload_params: true
 fsdp_use_orig_params: false
 fsdp_cpu_ram_efficient_loading: true
 fsdp_transformer_layer_cls_to_wrap: MixtralSparseMoeBlock
 fsdp_state_dict_type: FULL_STATE_DICT
 fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP

special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|begin_func|>"
  - "<|end_func|>"
  - "<|begin_func_response|>"
  - "<|end_func_response|>"
  - "<|im_start|>"
  - "<|im_end|>"

bigstral-ds-zero3 config

base_model: 0-hero/Matter-0.2-8x22B
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

unfrozen_parameters:
  - ^lm_head.weight$
  - ^model.embed_tokens.weight$
  - model.layers.4[4-9]+.block_sparse_moe.gate
  - model.layers.4[4-9]+.block_sparse_moe.experts
  - model.layers.5[0-5]+.block_sparse_moe.gate
  - model.layers.5[0-5]+.block_sparse_moe.experts

model_config:
  output_router_logits: true

rl: orpo
datasets:
  - path: mlabonne/orpo-mix-40k
    split: train
    type: orpo.chat_template

chat_template: chatml
default_system_message: You are a helpful assistant

dataset_prepared_path: data
val_set_size: 0
output_dir: output

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: false

gradient_accumulation_steps: 8
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
save_total_limit: 1
save_steps:
debug:
deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_params.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|begin_func|>"
  - "<|end_func|>"
  - "<|begin_func_response|>"
  - "<|end_func_response|>"
  - "<|im_start|>"
  - "<|im_end|>"

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10,3.11

axolotl branch-commit

main/0eadfc8

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@0-hero 0-hero added the bug Something isn't working label Apr 18, 2024
@0-hero
Copy link
Contributor Author

0-hero commented Apr 18, 2024

@winglian raised as new issue as mentioned in the other discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant