diff --git a/README.md b/README.md index 00ca765..083500a 100644 --- a/README.md +++ b/README.md @@ -1 +1,140 @@ -# LLM_Judge_ku \ No newline at end of file +# LLM Judge + +In this package, you can use Vicuna-Japanese questions and prompts to evaluate your models with LLM-as-a-judge. +To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses. + +## Contents +- [Install](#install) +- [Review Pre-Generated Model Answers and Judgments](#review-pre-generated-model-answers-and-judgments) +## Install +``` +git clone https://github.com/hitoshizuku7/LLM_Judge_ku.git +cd LLM_Judge_ku +pip install -e . +pip install openai anthropic ray +cd fastchat/llm_judge +``` + + +### Evaluate a model on jp-bench (Vicuna-Japanese) + +#### Step 1. Generate model answers to jp-bench questions +``` +python gen_model_answer.py \ +--base_model [MODEL-PATH] \ +--lora_model [LORA-PATH] \ +--model-id [MODEL-ID] \ +--with_prompt \ +--gpus [GPU_Num] \ +--max_new_tokens [NUM of NEW TOKENS] \ +--benchmark jp_bench +``` +Arguments: + - `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID. + - `[LORA-PATH]` is the path to the lora weights if needed. + - `[MODEL-ID]` is a name you give to the model. + - `[GPU_Num]` denotes which GPU you decide to use + + +e.g., +``` +python gen_model_answer.py \ +--model-path rinna/japanese-gpt-neox-3.6b-instruction-ppo \ +--model-id rinna-3.6b-ppo \ +--with_prompt \ +--gpus 0 \ +--max_new_tokens 2048 \ +--benchmark jp_bench +``` +The answers will be saved to `data/jp_bench/model_answer/[MODEL-ID].jsonl`. + +You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs. + +#### Step 2. Generate GPT-4 judgments +There are several options to use GPT-4 as a judge, such as pairwise winrate and single-answer grading. +In MT-bench, we recommond single-answer grading as the default mode. +This mode asks GPT-4 to grade and give a score to model's answer directly without pairwise comparison. +For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns. + +``` +OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \ +--bench-name "jp_bench" \ +--mode [pairwise-all, single, pairwise-baseline] \ +--model-list [LIST-OF-MODEL-ID] \ +--parallel [num-concurrent-api-call] +``` + +e.g., +``` +OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \ +--bench-name "jp_bench" \ +--mode single \ +--model-list rinna-3.6b rinna-3.6b-ppo \ +--parallel 2 +``` +The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_single.jsonl` + +#### Step 3. Show jp-bench scores + +- Show the scores for selected models + ``` + python show_result.py \ + --bench-name "jp_bench" \ + --mode single \ + --model-listrinna-3.6b rinna-3.6b-ppo + ``` +- Show all scores + ``` + python show_result.py + ``` + +--- + +### Other grading options +Besides score-based single-answer grading, we also support two additional grading options based on win rates: +- `pariwise-baseline`: run pairwise comparison against a baseline model. +- `pairwise-all`: run pairwise comparison between all model pairs on all questions. + +#### Option 2: pairwise comparison against a baseline (default: gpt-3.5-turbo) + +- Generate GPT-4 judgments +``` +OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \ +--bench-name "jp_bench" \ +--mode pairwise-baseline \ +--model-list rinna-3.6b rinna-3.6b-ppo \ +--parallel 2 +``` +The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_pair.jsonl` + +- Show results +``` +python show_result.py \ +--bench-name "jp_bench" \ +--mode pairwise-baseline +``` + +#### Option 3: Run GPT-4 judge with all pair comparisons + +Another option is to run pairwise comparisons on all possible pairs. +This could be more expensive when #models increases, but it gives you a more comprehensive information. + +``` +OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \ +--bench-name "jp_bench" \ +--mode pairwise-all \ +--model-list [LIST-OF-MODEL-ID] \ +--parallel [num-concurrent-api-call] +``` + +``` +python show_result.py \ +--bench-name "jp_bench" \ +--mode pairwise-all +``` + + +## Sample Outputs +``` +Question: +``` diff --git a/fastchat/llm_judge/README.md b/fastchat/llm_judge/README.md deleted file mode 100644 index ebd0c7d..0000000 --- a/fastchat/llm_judge/README.md +++ /dev/null @@ -1,174 +0,0 @@ -# LLM Judge - -In this package, you can use Vicuna-Japanese questions and prompts to evaluate your models with LLM-as-a-judge. -To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses. - -## Contents -- [Install](#install) -- [Review Pre-Generated Model Answers and Judgments](#review-pre-generated-model-answers-and-judgments) -## Install -``` -git clone https://github.com/hitoshizuku7/LLM_Judge_ku.git -cd LLM_Judge_ku -pip install -e . -pip install openai anthropic ray -cd fastchat/llm_judge -``` - - -### Evaluate a model on jp-bench (Vicuna-Japanese) - -#### Step 1. Generate model answers to jp-bench questions -``` -python gen_model_answer.py \ ---base_model [MODEL-PATH] \ ---lora_model [LORA-PATH] \ ---model-id [MODEL-ID] \ ---with_prompt \ ---gpus [GPU_Num] \ ---max_new_tokens [NUM of NEW TOKENS] \ ---benchmark jp_bench -``` -Arguments: - - `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID. - - `[LORA-PATH]` is the path to the lora weights if needed. - - `[MODEL-ID]` is a name you give to the model. - - `[GPU_Num]` denotes which GPU you decide to use - - -e.g., -``` -python gen_model_answer.py \ ---model-path rinna/japanese-gpt-neox-3.6b-instruction-ppo \ ---model-id rinna-3.6b-ppo \ ---with_prompt \ ---gpus 0 \ ---max_new_tokens 2048 \ ---benchmark jp_bench -``` -The answers will be saved to `data/jp_bench/model_answer/[MODEL-ID].jsonl`. - -You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs. - -#### Step 2. Generate GPT-4 judgments -There are several options to use GPT-4 as a judge, such as pairwise winrate and single-answer grading. -In MT-bench, we recommond single-answer grading as the default mode. -This mode asks GPT-4 to grade and give a score to model's answer directly without pairwise comparison. -For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns. - -``` -OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \ ---bench-name "jp_bench" \ ---mode [pairwise-all, single, pairwise-baseline] \ ---model-list [LIST-OF-MODEL-ID] \ ---parallel [num-concurrent-api-call] -``` - -e.g., -``` -OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \ ---bench-name "jp_bench" \ ---mode single \ ---model-list rinna-3.6b rinna-3.6b-ppo \ ---parallel 2 -``` -The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_single.jsonl` - -#### Step 3. Show jp-bench scores - -- Show the scores for selected models - ``` - python show_result.py \ - --bench-name "jp_bench" \ - --mode single \ - --model-listrinna-3.6b rinna-3.6b-ppo - ``` -- Show all scores - ``` - python show_result.py - ``` - ---- - -### Other grading options -Besides score-based single-answer grading, we also support two additional grading options based on win rates: -- `pariwise-baseline`: run pairwise comparison against a baseline model. -- `pairwise-all`: run pairwise comparison between all model pairs on all questions. - -#### Option 2: pairwise comparison against a baseline (default: gpt-3.5-turbo) - -- Generate GPT-4 judgments -``` -OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \ ---bench-name "jp_bench" \ ---mode pairwise-baseline \ ---model-list rinna-3.6b rinna-3.6b-ppo \ ---parallel 2 -``` -The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_pair.jsonl` - -- Show results -``` -python show_result.py \ ---bench-name "jp_bench" \ ---mode pairwise-baseline -``` - -#### Option 3: Run GPT-4 judge with all pair comparisons - -Another option is to run pairwise comparisons on all possible pairs. -This could be more expensive when #models increases, but it gives you a more comprehensive information. - -``` -OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \ ---bench-name "jp_bench" \ ---mode pairwise-all \ ---model-list [LIST-OF-MODEL-ID] \ ---parallel [num-concurrent-api-call] -``` - -``` -python show_result.py \ ---bench-name "jp_bench" \ ---mode pairwise-all -``` - - -## Agreement Computation -We released 3.3K human annotations for model responses generated by 6 models in response to 80 MT-bench questions. The dataset is available at [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments). -You can use this data to compute the agreement between human and GPT-4. - -### Download data - -``` -wget https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/resolve/main/human_judgments.json -wget https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/resolve/main/gpt4_pair_judgments.json -``` - -### Compute the agreement between human and GPT-4 - -``` -python compute_agreement.py --judges gpt4-pair human --votefiles human_judgments.json gpt4_pair_judgments.json -``` - -## Release Plan -Our current release contains: -- The MT-bench questions, prompts, pre-generated answers, and pre-generated judgments. -- The 3K expert-level human annotations. - -The next release will include: -- 30K arena conversations with human votes - -## Citation - -If you find the repository helpful for your study, please consider citing the following [paper](https://arxiv.org/abs/2306.05685): "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena": -``` -@misc{zheng2023judging, - title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena}, - author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica}, - year={2023}, - eprint={2306.05685}, - archivePrefix={arXiv}, - primaryClass={cs.CL} -} -```