This repo is the report for assignment 2 of the course NLP702 - Advanced NLP. In this assignment, we try to instruct-finetune GPT2 model and analyze the performance of the model before and after finetuning process. The number of required data samples is 5000, which can be generated by ChatGPT or GPT2 model itself.
Run the following command to pull the stanford_alpaca module, which is used to generate dataset
git pull --recurse-submodules
git submodule update --init --recursive
There are already a subset of 5k samples of alpaca data in the folder [./modules/stanford_alpaca/data]. If you want to generate new subset, run
cd ./modules/stanford_alpaca
python -m generate_instruction generate_instruction_alpaca
Run
cd ./modules/stanford_alpaca
python -m generate_instruction generate_instruction_following_data
With this option, you need to pay for the ChatGPT API.
In progress...
For this option, you need to prepare your own my_seed_tasks.jsonl
and my_prompt.txt
files.
Run
cd ./modules/stanford_alpaca
python -m generate_instruction generate_instruction_following_data_gpt2
Note: the generated data of gpt2 is currently terrible, we cannot use it for finetuning. The script above is just used for getting one example of generated text. You can see the result in the file result.txt.
At first, you need to download the model to your local disk (in case you are using cluster).
Then cd
in to modules/stanford_alpaca
folder and set the environment variable MODEL_PATH
.
export MODEL_PATH=<your_model_path>
export OUTPUT_DIR=<output_path>
export NUM_GPU=<number_of_available_gpus>
Run the following commnand to start finetuning
bash finetune_gpt.sh