Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine Tuning with LoRA failed during train step #34

Open
arjun-mavonic opened this issue May 9, 2024 · 1 comment
Open

Fine Tuning with LoRA failed during train step #34

arjun-mavonic opened this issue May 9, 2024 · 1 comment

Comments

@arjun-mavonic
Copy link

arjun-mavonic commented May 9, 2024

Below is the notebook link from your blog - https://huggingface.co/blog/personal-copilot
https://colab.research.google.com/drive/1Tz9KKgacppA4S6H4eo_sw43qEaC9lFLs?usp=sharing

!git pull
!python train.py \
    --model_name_or_path "bigcode/starcoder" \
    --dataset_name "smangrul/hf-stack-v1" \
    --subset "data" \
    --data_column "content" \
    --splits "train" \
    --seq_length 2048 \
    --max_steps 2000 \
    --batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --num_warmup_steps 30 \
    --eval_freq 100 \
    --save_freq 100 \
    --log_freq 25 \
    --num_workers 4 \
    --bf16 \
    --no_fp16 \
    --output_dir "peft-lora-starcoder15B-v2-personal-copilot-A100-40GB-colab" \
    --fim_rate 0.5 \
    --fim_spm_rate 0.5 \
    --use_peft_lora \
    --lora_r 32 \
    --lora_alpha 64 \
    --lora_dropout 0.0 \
    --lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj" \
    --use_flash_attn \
    --use_4bit_qunatization \
    --use_nested_quant \
    --bnb_4bit_compute_dtype "bfloat16"

I am stuck at this step.

Below is the error

Already up to date.
2024-05-09 20:44:58.617684: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-09 20:44:58.617733: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-09 20:44:58.619695: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-09 20:44:58.630452: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-09 20:45:00.111432: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 494, in <module>
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 348, in parse_args_into_dataclasses
    raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--subset', 'data', '--data_column', 'content', '--seq_length', '2048', '--batch_size', '4', '--num_warmup_steps', '30', '--eval_freq', '100', '--save_freq', '100', '--log_freq', '25', '--num_workers', '4', '--no_fp16', '--use_4bit_qunatization']
@lydonchandra
Copy link

probably can remove those args

!git pull
!python train.py
--model_name_or_path "bigcode/starcoder"
--dataset_name "smangrul/hf-stack-v1"
--splits "train"
--max_steps 2000
--gradient_accumulation_steps 4
--learning_rate 5e-4
--lr_scheduler_type "cosine"
--weight_decay 0.01
--bf16
--output_dir "peft-lora-starcoder15B-v2-personal-copilot-A100-40GB-colab"
--fim_rate 0.5
--fim_spm_rate 0.5
--use_peft_lora
--lora_r 32
--lora_alpha 64
--lora_dropout 0.0
--lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj"
--use_flash_attn
--use_nested_quant
--bnb_4bit_compute_dtype "bfloat16"

though some errors after

Already up to date.
2024-12-01 21:38:55.640910: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-12-01 21:38:55.658391: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-01 21:38:55.679221: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-01 21:38:55.685518: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-01 21:38:55.700479: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-01 21:38:56.771930: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
tokenizer_config.json: 100% 677/677 [00:00<00:00, 4.44MB/s]
vocab.json: 100% 777k/777k [00:00<00:00, 4.14MB/s]
merges.txt: 100% 442k/442k [00:00<00:00, 6.52MB/s]
tokenizer.json: 100% 2.06M/2.06M [00:00<00:00, 15.6MB/s]
special_tokens_map.json: 100% 532/532 [00:00<00:00, 3.78MB/s]
README.md: 100% 478/478 [00:00<00:00, 2.88MB/s]
(…)-00000-of-00001-31e9377455e783e7.parquet: 100% 30.6M/30.6M [00:00<00:00, 84.3MB/s]
Generating train split: 100% 5905/5905 [00:00<00:00, 13936.15 examples/s]
Size of the train set: 5314. Size of the validation set: 591
0% 0/400 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 495, in
main(model_args, data_args, training_args)
File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 440, in main
train_dataset, eval_dataset = create_datasets(
File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 262, in create_datasets
chars_per_token = chars_token_ratio(train_data, tokenizer, args.dataset_text_field)
File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 138, in chars_token_ratio
total_characters += len(example[data_column])
KeyError: 'text'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants