Skip to content

Commit

Permalink
readme2
Browse files Browse the repository at this point in the history
  • Loading branch information
YukinoWan committed Aug 24, 2023
1 parent c042f6c commit 3b3de27
Show file tree
Hide file tree
Showing 2 changed files with 140 additions and 175 deletions.
141 changes: 140 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,140 @@
# LLM_Judge_ku
# LLM Judge

In this package, you can use Vicuna-Japanese questions and prompts to evaluate your models with LLM-as-a-judge.
To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.

## Contents
- [Install](#install)
- [Review Pre-Generated Model Answers and Judgments](#review-pre-generated-model-answers-and-judgments)
## Install
```
git clone https://github.com/hitoshizuku7/LLM_Judge_ku.git
cd LLM_Judge_ku
pip install -e .
pip install openai anthropic ray
cd fastchat/llm_judge
```


### Evaluate a model on jp-bench (Vicuna-Japanese)

#### Step 1. Generate model answers to jp-bench questions
```
python gen_model_answer.py \
--base_model [MODEL-PATH] \
--lora_model [LORA-PATH] \
--model-id [MODEL-ID] \
--with_prompt \
--gpus [GPU_Num] \
--max_new_tokens [NUM of NEW TOKENS] \
--benchmark jp_bench
```
Arguments:
- `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID.
- `[LORA-PATH]` is the path to the lora weights if needed.
- `[MODEL-ID]` is a name you give to the model.
- `[GPU_Num]` denotes which GPU you decide to use


e.g.,
```
python gen_model_answer.py \
--model-path rinna/japanese-gpt-neox-3.6b-instruction-ppo \
--model-id rinna-3.6b-ppo \
--with_prompt \
--gpus 0 \
--max_new_tokens 2048 \
--benchmark jp_bench
```
The answers will be saved to `data/jp_bench/model_answer/[MODEL-ID].jsonl`.

You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs.

#### Step 2. Generate GPT-4 judgments
There are several options to use GPT-4 as a judge, such as pairwise winrate and single-answer grading.
In MT-bench, we recommond single-answer grading as the default mode.
This mode asks GPT-4 to grade and give a score to model's answer directly without pairwise comparison.
For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.

```
OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \
--bench-name "jp_bench" \
--mode [pairwise-all, single, pairwise-baseline] \
--model-list [LIST-OF-MODEL-ID] \
--parallel [num-concurrent-api-call]
```

e.g.,
```
OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \
--bench-name "jp_bench" \
--mode single \
--model-list rinna-3.6b rinna-3.6b-ppo \
--parallel 2
```
The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_single.jsonl`

#### Step 3. Show jp-bench scores

- Show the scores for selected models
```
python show_result.py \
--bench-name "jp_bench" \
--mode single \
--model-listrinna-3.6b rinna-3.6b-ppo
```
- Show all scores
```
python show_result.py
```

---

### Other grading options
Besides score-based single-answer grading, we also support two additional grading options based on win rates:
- `pariwise-baseline`: run pairwise comparison against a baseline model.
- `pairwise-all`: run pairwise comparison between all model pairs on all questions.

#### Option 2: pairwise comparison against a baseline (default: gpt-3.5-turbo)

- Generate GPT-4 judgments
```
OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \
--bench-name "jp_bench" \
--mode pairwise-baseline \
--model-list rinna-3.6b rinna-3.6b-ppo \
--parallel 2
```
The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_pair.jsonl`

- Show results
```
python show_result.py \
--bench-name "jp_bench" \
--mode pairwise-baseline
```

#### Option 3: Run GPT-4 judge with all pair comparisons

Another option is to run pairwise comparisons on all possible pairs.
This could be more expensive when #models increases, but it gives you a more comprehensive information.

```
OPENAI_API_KEY=[YOUR-KEY] python -B gen_judgment.py \
--bench-name "jp_bench" \
--mode pairwise-all \
--model-list [LIST-OF-MODEL-ID] \
--parallel [num-concurrent-api-call]
```

```
python show_result.py \
--bench-name "jp_bench" \
--mode pairwise-all
```


## Sample Outputs
```
Question:
```
174 changes: 0 additions & 174 deletions fastchat/llm_judge/README.md

This file was deleted.

0 comments on commit 3b3de27

Please sign in to comment.