Skip to content

Latest commit

 

History

History
72 lines (48 loc) · 1.92 KB

README.md

File metadata and controls

72 lines (48 loc) · 1.92 KB

instruct_finetune_gpt2

This repo is the report for assignment 2 of the course NLP702 - Advanced NLP. In this assignment, we try to instruct-finetune GPT2 model and analyze the performance of the model before and after finetuning process. The number of required data samples is 5000, which can be generated by ChatGPT or GPT2 model itself.

Setup

Run the following command to pull the stanford_alpaca module, which is used to generate dataset

git pull --recurse-submodules
git submodule update --init --recursive

Get dataset

Subset of alpaca data

There are already a subset of 5k samples of alpaca data in the folder [./modules/stanford_alpaca/data]. If you want to generate new subset, run

cd ./modules/stanford_alpaca
python -m generate_instruction generate_instruction_alpaca

Generate new data with ChatGPT

Run

cd ./modules/stanford_alpaca
python -m generate_instruction generate_instruction_following_data

With this option, you need to pay for the ChatGPT API.

Generate new data with GPT2

In progress...

For this option, you need to prepare your own my_seed_tasks.jsonl and my_prompt.txt files.

Run

cd ./modules/stanford_alpaca
python -m generate_instruction generate_instruction_following_data_gpt2

Note: the generated data of gpt2 is currently terrible, we cannot use it for finetuning. The script above is just used for getting one example of generated text. You can see the result in the file result.txt.

Finetune GPT

At first, you need to download the model to your local disk (in case you are using cluster).

Then cd in to modules/stanford_alpaca folder and set the environment variable MODEL_PATH.

export MODEL_PATH=<your_model_path>
export OUTPUT_DIR=<output_path>
export NUM_GPU=<number_of_available_gpus>

Run the following commnand to start finetuning

bash finetune_gpt.sh