Fine Tuning with LoRA failed during train step #34

arjun-mavonic · 2024-05-09T20:48:00Z

Below is the notebook link from your blog - https://huggingface.co/blog/personal-copilot
https://colab.research.google.com/drive/1Tz9KKgacppA4S6H4eo_sw43qEaC9lFLs?usp=sharing

!git pull
!python train.py \
    --model_name_or_path "bigcode/starcoder" \
    --dataset_name "smangrul/hf-stack-v1" \
    --subset "data" \
    --data_column "content" \
    --splits "train" \
    --seq_length 2048 \
    --max_steps 2000 \
    --batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --num_warmup_steps 30 \
    --eval_freq 100 \
    --save_freq 100 \
    --log_freq 25 \
    --num_workers 4 \
    --bf16 \
    --no_fp16 \
    --output_dir "peft-lora-starcoder15B-v2-personal-copilot-A100-40GB-colab" \
    --fim_rate 0.5 \
    --fim_spm_rate 0.5 \
    --use_peft_lora \
    --lora_r 32 \
    --lora_alpha 64 \
    --lora_dropout 0.0 \
    --lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj" \
    --use_flash_attn \
    --use_4bit_qunatization \
    --use_nested_quant \
    --bnb_4bit_compute_dtype "bfloat16"

I am stuck at this step.

Below is the error

Already up to date.
2024-05-09 20:44:58.617684: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-09 20:44:58.617733: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-09 20:44:58.619695: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-09 20:44:58.630452: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-09 20:45:00.111432: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 494, in <module>
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 348, in parse_args_into_dataclasses
    raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--subset', 'data', '--data_column', 'content', '--seq_length', '2048', '--batch_size', '4', '--num_warmup_steps', '30', '--eval_freq', '100', '--save_freq', '100', '--log_freq', '25', '--num_workers', '4', '--no_fp16', '--use_4bit_qunatization']

The text was updated successfully, but these errors were encountered:

lydonchandra · 2024-12-01T21:44:14Z

probably can remove those args

!git pull
!python train.py
--model_name_or_path "bigcode/starcoder"
--dataset_name "smangrul/hf-stack-v1"
--splits "train"
--max_steps 2000
--gradient_accumulation_steps 4
--learning_rate 5e-4
--lr_scheduler_type "cosine"
--weight_decay 0.01
--bf16
--output_dir "peft-lora-starcoder15B-v2-personal-copilot-A100-40GB-colab"
--fim_rate 0.5
--fim_spm_rate 0.5
--use_peft_lora
--lora_r 32
--lora_alpha 64
--lora_dropout 0.0
--lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj"
--use_flash_attn
--use_nested_quant
--bnb_4bit_compute_dtype "bfloat16"

though some errors after

Already up to date.
2024-12-01 21:38:55.640910: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-12-01 21:38:55.658391: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-01 21:38:55.679221: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-01 21:38:55.685518: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-01 21:38:55.700479: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-01 21:38:56.771930: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
tokenizer_config.json: 100% 677/677 [00:00<00:00, 4.44MB/s]
vocab.json: 100% 777k/777k [00:00<00:00, 4.14MB/s]
merges.txt: 100% 442k/442k [00:00<00:00, 6.52MB/s]
tokenizer.json: 100% 2.06M/2.06M [00:00<00:00, 15.6MB/s]
special_tokens_map.json: 100% 532/532 [00:00<00:00, 3.78MB/s]
README.md: 100% 478/478 [00:00<00:00, 2.88MB/s]
(…)-00000-of-00001-31e9377455e783e7.parquet: 100% 30.6M/30.6M [00:00<00:00, 84.3MB/s]
Generating train split: 100% 5905/5905 [00:00<00:00, 13936.15 examples/s]
Size of the train set: 5314. Size of the validation set: 591
0% 0/400 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 495, in
main(model_args, data_args, training_args)
File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 440, in main
train_dataset, eval_dataset = create_datasets(
File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 262, in create_datasets
chars_per_token = chars_token_ratio(train_data, tokenizer, args.dataset_text_field)
File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 138, in chars_token_ratio
total_characters += len(example[data_column])
KeyError: 'text'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine Tuning with LoRA failed during train step #34

Fine Tuning with LoRA failed during train step #34

arjun-mavonic commented May 9, 2024 •

edited

Loading

lydonchandra commented Dec 1, 2024

Fine Tuning with LoRA failed during train step #34

Fine Tuning with LoRA failed during train step #34

Comments

arjun-mavonic commented May 9, 2024 • edited Loading

lydonchandra commented Dec 1, 2024

arjun-mavonic commented May 9, 2024 •

edited

Loading