Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(OOM) FSDP+QLora 2*RTX3090 (24G per card) finetuning on 70b Llama2 #1522

Open
6 of 8 tasks
yaohwang opened this issue Apr 16, 2024 · 5 comments
Open
6 of 8 tasks

(OOM) FSDP+QLora 2*RTX3090 (24G per card) finetuning on 70b Llama2 #1522

yaohwang opened this issue Apr 16, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@yaohwang
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

expecting no OOM.

with #1494 fixed, I've been tested that 7b Llama works right now with FSDP+QLora on axolotl.

but FSDP+QLora of Answer.AI worked with 70b Llama which I've been tested (with the same 2*RTX3090), so I'm expecting this work with axolotl's FSDP+QLora too.

Current behaviour

Traceback (most recent call last):
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in
fire.Fire(do_cli)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in
fire.Fire(do_cli)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 55, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
File "/workspace/axolotl/src/axolotl/train.py", line 87, in train
model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
File "/workspace/axolotl/src/axolotl/utils/models.py", line 799, in load_model
model.to(f"cuda:{cfg.local_rank}")
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
component = fn(*varargs, **kwargs)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 55, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
File "/workspace/axolotl/src/axolotl/train.py", line 87, in train
model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
File "/workspace/axolotl/src/axolotl/utils/models.py", line 799, in load_model
model.to(f"cuda:{cfg.local_rank}")
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 5 more times]
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 318, in to
return self._apply(convert)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 5 more times]
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 318, in to
new_param = Params4bit(super().to(device=device, dtype=dtype, non_blocking=non_blocking),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 1 has a total capacty of 23.69 GiB of which 6.94 MiB is free. Process 156255 has 23.68 GiB memory in use. Of the allocated memory 22.54 GiB is allocated by PyTorch, and 20.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
new_param = Params4bit(super().to(device=device, dtype=dtype, non_blocking=non_blocking),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 23.68 GiB of which 49.00 MiB is free. Process 156254 has 23.63 GiB memory in use. Of the allocated memory 22.43 GiB is allocated by PyTorch, and 67.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Steps to reproduce

accelerate launch -m axolotl.cli.train examples/llama-2/qlora-fsdp.yml

with examples/llama-2/qlora-fsdp.yml change to base_model: NousResearch/Llama-2-70b-chat-hf and batch size 1.

Config yaml

ref: examples/llama-2/qlora-fsdp.yml

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main/4d6490b

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@yaohwang yaohwang added the bug Something isn't working label Apr 16, 2024
@winglian
Copy link
Collaborator

try changing these settings

micro_batch_size: 1
optimizer: paged_adamw_8bit

@yaohwang
Copy link
Author

try changing these settings

micro_batch_size: 1
optimizer: paged_adamw_8bit

thanks for your help, but still get the same error

`base_model: NousResearch/Llama-2-70b-chat-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:

  • path: yahma/alpaca-cleaned
    type: alpaca
    dataset_prepared_path: last_run_prepared
    val_set_size: 0.05
    output_dir: ./qlora-out

adapter: qlora
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.00001

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:

  • full_shard
  • auto_wrap
    fsdp_config:
    fsdp_limit_all_gathers: true
    fsdp_sync_module_states: true
    fsdp_offload_params: false
    fsdp_use_orig_params: false
    fsdp_cpu_ram_efficient_loading: true
    fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
    fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
    fsdp_state_dict_type: FULL_STATE_DICT
    special_tokens:`

@yaohwang yaohwang reopened this Apr 17, 2024
@winglian
Copy link
Collaborator

how much CPU memory do you have? Keep I mind that offloading 70B llama-2 requires 128GB of system/CPU RAM.

@yaohwang
Copy link
Author

how much CPU memory do you have? Keep I mind that offloading 70B llama-2 requires 128GB of system/CPU RAM.

yeah, that's it, 128GB RAM and 2x 24G RTX3090, and I've been tested 70b llama2 on https://github.com/AnswerDotAI/fsdp_qlora before I use axolotl, with the same environment, it worked.

so I'm expecting axolotl having FSDP+QLORA get the same thing work.

and thanks man, you are doing great job!

@orgmast5
Copy link

@yaohwang @winglian any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants