This repo gathers the work that has been done by the BBO team @ GERAD for the Hyperparameter Optimization (HPO) project in the context of the Alliance with Huawei-Canada between september 2023 and february 2024.
The PDF document approach.pdf
in the docs
folder thoroughly describes the theory behind our work, as well as our approach. It is recommended to read it first in order to fully understand what is undertaken here.
In the following, the reader is assumed to be familiar with the theory developed in this document, especially with: the Transformers architecture [Vas+17], the concept of instruction-tuning, the LoRA fine-tuning method [Hu+21], 4 widespread NLP tests (MMLU, BBH, DROP and HumanEval) and the family of LLaMA language models. [Touv+23]
We perform the fine-tuning of the 7B parameter variant of LLaMA 2 on a 53k-sized dataset of instructions with the LoRA fine-tuning method. Recall that the LoRA method relies on the following quasi-equality:
with
We seek to optimize the choice of 4 hyperparameters (HPs) within this context:
The HuggingFace API is central to our experiment. It implements language models, training and test procedures. See Transformers and PEFT especially.
Global requirements are:
- Python >= 3.9
- All dependecies are listed in the
requirements.txt
file. Run
python3.9 -m pip install -r requirements.txt
to install all libraries at once.
- If you wish to use a LLaMA model through the HuggingFace Transformers API, you will need to be authorized by Meta AI. Follow the instructions on the HuggingFace webpage dedicated to the model you want to use (for the 7B version we used, see here).
- Plan that you will need a significant amount of GPU memory. At GERAD, only the
atlas
server was able to run our experiments. You will need the A100 GPUs with 80Gb RAM (40Gb will not be enough). :white_check_mark: you CAN run an optimization with less than 4 GPUs but expect the computation to be slower. :x: you CANNOT run an optimization with any of the 40Gb RAM GPUs. The size of the model and the data will cause an overflow. - NOMAD 4
bbo
contains all files needed to reproduce the experiment described in experiment 1.bbo2
contains all files needed to reproduce the experiment described in experiment 2.bbo3
contains all files needed to reproduce the experiment described in experiment 3.blind_eval
contains data and scripts we used to generate text answers from our models in order to conduct the survey for human evaluation.data
contains the data used for training and valiation.eval
contains some scripts useful to run the evaluation of the model on a dataset.nni
contains the files needed to reproduce an experiment that is described in deeper details in the appropriate folder.plot
contains a script to draw a parallel plot from a statistics file.train
contains scripts used to run the training phase of our pipeline.
The pipeline is usually broken down into 2 files:
bb.py
- reads the encoded values of the 4 HPs given by NOMAD,
- if a history file is provided, checks whether a very close point has already been evaluated. If so, returns the associated blackbox value,
- calls
eval.py
to perform the training and evaluation phases, - reads the results of the evaluation in a file and returns it to NOMAD.
eval.py
- reads the encoded values of the 4 HPs given by NOMAD and translates them into actual values (see encoding for every experiment),
- chooses the GPUs that will be used for computation (variable
cuda_visible_devices
at the beginning of the file). Before lauching an experiment, check which GPUs are available withnvidia-smi
. A script like nvidia-htop can be useful if you need to see who is running processes on GPUs (and know how you can share the resources); - runs the following command in order to perform training and validation with the appropriate HPs. Elements that should be adapted to your local setup or customized are between braces
<the element>
.
source <venv_path>; export HF_HOME=<hf_path>; export WANDB_MODE=offline;
CUDA_VISIBLE_DEVICES=<gpus> torchrun -r 2
--log_dir <log_path> --proc_per_node <nb_gpus> train/train_with_LoRa_mixed_data.py
--model_name_or_path <pt_model> --data_path_train <training_data> --do_eval <eval>
--data_path_eval <eval_data> --bf16 True --output_dir <checkpoints_path>
--num_train_epochs <epochs> --per_device_train_batch_size <training_batch_size>
--per_device_eval_batch_size <eval_batch_size> --gradient_accumulation_steps 8
--evaluation_strategy <eval_strategy> --save_strategy "steps" --save_steps 2000
--save_total_limit 1 --learning_rate <lr> --weight_decay 0. --warmup_ratio 0.03
--lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --lora_rank <rank>
--lora_dropout <dropout> --lora_alpha <alpha>; deactivate
this command line will:
- load the Python virtual environment in
venv_path
(should point to theactivate
file); - indicate to the HuggingFace library where the language models have been downloaded (
hf_path
); - run the fine-tuning as follows, by relying on the
train//train_with_LoRa_mixed_data.py
script:- use the GPUs listed in
gpus
(following the format:0,1,2,3
for instance).nb_gpus
should equal the amount of these GPUs; - use the pretrained model denoted in
pt_model
(use the name as displayed on the HuggingFace Hub); - all hyperparameter values are given here:
rank
,alpha
,dropout
,lr
; - use the training data given in
training_data
; - use
epochs
training epochs with a batch size oftraining_batch_size
; - if
eval
is set toTrue
:- perform evaluation of the model on the dataset given in
eval_data
with a batch size ofeval_batch_size
; - the evaluation will be performed periodically depending on the value of
eval_strategy
("step"
at every training step (highly useless to perform it so often),"epoch"
at every training epoch);
- perform evaluation of the model on the dataset given in
- use the GPUs listed in
- the outputs will be stored as follows:
- the logs from PyTorch will be saved in
log_path
; - the LoRA weights output from the training (checkpoints) will be saved in
checkpoints_path
.
- the logs from PyTorch will be saved in
Feel free to change some parameters as you wish.
Giving $python3 bb.py
as the blackbox to NOMAD will suffice to run this pipeline. As NOMAD does not handle natively some of the types we used to define the possible values for each HP, we encoded them and translated them in the Python script. Each encoding will be described in the appropriate section below.
Each experiment sets a specific objective function and uses specific sets of values for each variable.
This experiment uses the MMLU score of the language model as the objective function to maximize. It uses the whole Alpaca dataset. Possible and initial values as well as encodings for each HP are as follows:
HP | Possible values | Initial value | NOMAD type | NOMAD encoding |
---|---|---|---|---|
int | ||||
dropout | int | |||
int | no need to encode | |||
learning rate | float |
|
It runs on 3 training epochs for every of the 50 blackbox evaluations.
This experiment uses the loss of the model on the validation dataset as the objective function to minimize. It uses the training and validation datasets in data
. Possible and default values, as well as encodings, are similar to experiment 1.
An experiment 2b has been run and is referred to as experiment 2 in the paper. It extends possible rank values to
The objective function is similar to that of experiment 2. Training and validation are performed on the same datasets. This experiment was run after the publication of the paper, in order to see how NOMAD behaves when constrained to low ranks. Also, it has been noticed that the oRA paper mentions that
HP | Possible values | Initial value | NOMAD type | NOMAD encoding |
---|---|---|---|---|
int | ||||
dropout | int | |||
int | no need to encode | |||
learning rate | float |
|
- Work on the learning rate and the number of epochs. It has been observed that low learning rates with 3 training epochs leads to failed training and thus poor performance. One could either
- keep 2-3 epochs for a reasonable computing time and seek to refine the bounds on the learning rate,
- or add the number of training epochs as a variable and let NOMAD optimize.
- Rethink generalization to downstream tasks. It was initially hoped that fine-tuning on our 54k-sized training set would yield good performance on downstream tasks because the literature says that generalization is often observed. This did not happen. Several directions:
- give up on universality of the model. Focus on a specific family of tasks and use appropriate data. Or keep aiming at universality, but more diverse data is needed.
- choose a more appropriate objective function than the validation loss.
- compute model score on a subset of the datasets of MMLU, BBH, DROP and HumanEval (for instance) and use this score as the objective function. Keep the rest of the datasets for the test,
- see the implementation of MMLU, BBH, DROP and HumanEval tests in InstructEval,
- review the literature on evaluating large language models to increase the number of metrics we use, and then maybe fully use MMLU or another test as the objective function to maximize, but test on other benchmarks.
For any questions about the theory or the code presented in this repo, you may contact:
[Vas+17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, & I. Polosukhin (2017). Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000–6010). Curran Associates Inc..
[Hu+21] E.J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, & W. Chen. (2021). LoRA: Low-Rank Adaptation of Large Language Models.
[Touv+23] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. Singh Koura, M-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E.M. Smith, R. Subramanian, X.E. Tan, B. Tang, R. Taylor, A. Williams, J.X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, & T. Scialom. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models..