Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8-bit precision error with fine tuning of gemma #1355

Closed
smreddy05 opened this issue Feb 22, 2024 · 9 comments
Closed

8-bit precision error with fine tuning of gemma #1355

smreddy05 opened this issue Feb 22, 2024 · 9 comments

Comments

@smreddy05
Copy link

I am trying to fine tune gemma7-b with 4 A100 80 GB gpus using 4-bit qunatization
model_id = "google/gemma-7b"

BitsAndBytesConfig int-4 config

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

print("initiating model download")

model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config,
use_cache=False,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto", token=access_token)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)

prepare model for training

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
from transformers import TrainingArguments
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=15,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,

gradient_checkpointing=True,

optim="paged_adamw_32bit",
logging_steps=100,
save_strategy="epoch",
learning_rate=2e-4,
bf16=True,
tf32=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
seed=42,
eval_steps=100,
lr_scheduler_type="cosine",
evaluation_strategy='epoch',
disable_tqdm=False,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
report_to="wandb",
run_name=run_name  # disable tqdm since with packing values are in correct

)
from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
formatting_func=generate_prompt, # this will aplly the create_prompt mapping to all training and test dataset
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"]
)
trainer.train()

This is throwing ""ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}""

the same script works for other models like llama2

versions used :
transformers:4.38.1
trl:0.7.11

@younesbelkada
Copy link
Contributor

Hi @smreddy05
Thanks for the issue !
Can you try out the solution proposed here: #1348 (comment)

@smreddy05
Copy link
Author

@younesbelkada thanks for your suggestion and i am hitting new issue
""torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.86 GiB. GPU 0 has a total capacity of 79.15 GiB of which 5.08 GiB is free. Process 73494 has 74.06 GiB memory in use. Of the allocated memory 69.76 GiB is allocated by PyTorch, and 2.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

@younesbelkada
Copy link
Contributor

@smreddy05
now you're facing a cuda OOM issue, can you try to use Flash Attention 2 or decrease the max_seq_len / batch_size ?

@smreddy05
Copy link
Author

Hey @younesbelkada , i was using flashattention from the moment I have faced 8-bit precision error and I tried reduing batch_size, still I am hitting same issue and the same code works for llama2. Not sure whats wrong with this. Will give it a try with previous versions of trl and accelerate.
Also, I am using 4-bit quantization but error talks about 8-bit precision. am I missing something here ? can you please share your thoughts on this? really appreciate your help on this

@younesbelkada
Copy link
Contributor

I suspect the reason why it worked for llama-2 is that llama has 6.74B parameters

Screenshot 2024-02-28 at 10 36 41

Whereas gemma-7b has in reality ~8.5B parameters

Screenshot 2024-02-28 at 10 36 56

You can also use gradient accumulation with very small batch size. For the error you are getting you need to update accelerate pip install -U accelerate

@smreddy05
Copy link
Author

smreddy05 commented Feb 28, 2024

@younesbelkada sorry for not being clear, i was referring to llama2-70B model and as of now I am on accelerate 0.27.2, trl=0.7.10
and I was using gradient_accumulation_steps=2,

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this as completed Apr 1, 2024
@VIS-WA
Copy link

VIS-WA commented Apr 25, 2024

Hi @smreddy05! Were you able to find a solution to fix the OutOfMemoryError error?
I have encountered a similar error where I am able to fine-tune llama2 13B but not gemma 7B (although I was using trainer from Transformers=4.41 library). This error occurs only when the evaluation is enabled (do_eval=True), setting it to False makes everything work like a charm.

@smreddy05
Copy link
Author

@VIS-WA, sorry, i haven't spent time on this. But, if we set do_eval=False then we cannot run any evaluation on validation set and due to this it might be tricky to judge how good fine tuned model is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants