This directory includes several complex environments. You should be able to use the various .yml
files to create new conda environments. If you don't already use conda, you need to install conda from here.
conda create --name <envName> --file <envName>.yml
The two environments available in this folder are for the SoC GPUs and for a local (conda) python installation. The cuda
version is for the SoC GPUs. Note that you may have to install some cuda toolkit software before the environment installs correctly.
Here is an explanation of each saved environment:
- qagHfCuda.yml - main environment to use for this repository and thesis. use this for training LLaMA 2 with the trainer, data formatter, data processor, and other scripts in the
src/
directory. - qagHf.yml - the non-cuda, local version of the environment. should allow for data processing and inference based on the final model
- fastT5.yml - used for generating ONNX models of the Potsawee T5 QAG model. Creating the models seemed to work, but opening them and running inference never worked.
- qagLmqg.yml && qagLmqgCuda.yml - used to test the lmqg python package for model training. This did not work with LLaMA 2.
- qagT5.yml && qagT5Cuda.yml - main environments used during the attempts to train T5 for QAG. never fully worked, but showed promise. also includes packages for Optimum ONNX generation which did work
Otherwise start a conda environment from scratch with:
conda create -n aqg
conda activate aqg
conda install python=3.11.5
Install the python packages that you need.
For optimum and t5:
pip install datasets evaluate fastt5 huggingface kaggle pandas numpy onnx onnxruntime optimum tokenizers torch transformers nltk
For LLaMA 2 training:
pip install datasets evaluate huggingface numpy pandas transformers tokenizers torch
It's up to you to figure out how to install cuda toolkit on your machine.
To fine-tune a new model set up the proper config for the model type you are training. There are three types:
- AE
- QG
- E2E
Then run trainer.py. The run stats should be available in the pbe_qag team on wandb.ai. Each model type is it's own "project."
To run AE and QG traing back-to-back, just start with type
in qag.ini as AE
and run:
python trainer.py; python trainer.py
And then once the first training has started, simply change type
to QG
. Save the file. This will run AE training first and, when complete, will run another trainig, but this time QG as that is what qag.ini specifies.
To run the model, specify whether you want pipeline or end-to-end generation in the ocnfig file, qag.ini. Values of AE or QG will result in pipeline generation. E2E will enable end-to-end generation. Then, just run generator.py.
python generator.py
You will be in an inference loop where you can enter a verse reference for generation or press enter for a random verse.
The configuration for the project is in qag.ini. This file determines the current model source, data source, and inference prompt used. Data && logs are kept in the data folder.
The resources folder is for miscellaneous project resources.