This repo contains the code, data, and models for TMLR 2024 paper "TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks"
- [12/2] TIGERScore now support running with llama.cpp, check Quantization Support Cpu for details
We present 🐯 TIGERScore, a Trained metric that follows Instruction Guidance to perform Explainable, and Reference-free evaluation over a wide spectrum of text generation tasks.
Existing automatic metrics either are lagging and suffer from issues like 1) Dependency on references, 2) Limited to specific domains, 3) Lack of attribution. Contrary to them, TIGERScore is designed to be driven by natural language instruction and provide detailed error analysis to pinpoint the mistakes in the generated text.
Specifically, TIGERScore takes an instruction, an associated input context along with a hypothesis output that might contain errors. Then, TIGERScore will evaluate this hypothesis output and list several errors, each consisting of the error location, aspect, explanation and penalty scores (score reduced, starting from 0). The sum of the reduced scores is taken as the overall rating of this output.
Experiments show that TIGERScore surpass existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
Datasets |
---|
📏 MetricInstruct |
Models 🐯 |
---|
🦙 TIGERScore-7B |
🦙 TIGERScore-13B |
🦙 TIGERScore-7B-GGUF |
🦙 TIGERScore-13B-GGUF |
TIGERScore-Yi-6B |
Other Resources |
---|
🤗 TIGERScore Collections |
🤗 Huggingface Demo |
To directly use tigerscore pipeline, you first need to install it as a python package.
pip install git+https://github.com/TIGER-AI-Lab/TIGERScore.git
Please do check if your torch.cuda.is_available()
is True
for your local machine.
Besides, to use TIGERScore with vllm detailed here, you need to mannually install vllm following vllm document.
- if your CUDA is 12.1
pip install vllm
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
- if your CUDA is 11.8
# Replace `cp39` with your Python version (e.g., `cp38`, `cp39`, `cp311`).
pip install https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
If you want to use the training scripts, install the dependencies by running the following command:
pip install -r requirements.txt
After installation, you are good to score the text generations with the following exmaple python code (see in tigerscore_example_usage.ipynb
for more use cases) :
# gpu device setup
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# example
instruction = "Write an apology letter."
input_context = "Reason: You canceled a plan at the last minute due to illness."
hypo_output = "Hey [Recipient],\n\nI'm really sorry for ditching our plan. I suddenly got an opportunity for a vacation so I took it. I know this might have messed up your plans and I regret that.\n\nDespite being under the weather, I would rather go for an adventure. I hope you can understand my perspective and I hope this incident doesn't change anything between us.\n\nWe can reschedule our plan for another time. Sorry again for the trouble.\n\nPeace out,\n[Your Name]\n\n---"
# Load and evaluate examples in all options in 3 lines of code
from tigerscore import TIGERScorer
scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B") # on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", quantized=True) # 4 bit quantization on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", use_vllm=True) # VLLM on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B-GGUF", use_llamacpp=True) # 4 bit quantization on CPU
results = scorer.score([instruction], [hypo_output], [input_context])
# print the results, which is a list of json output containging the automatically parsed results!
print(results)
The results is a list of dicts consisting of structured error analysis.
[
{
"num_errors": 3,
"score": -12.0,
"errors": {
"error_0": {
"location": "\"I'm really glad for ditching our plan.\"",
"aspect": "Inappropriate language or tone",
"explanation": "The phrase \"ditching our plan\" is informal and disrespectful. It should be replaced with a more respectful and apologetic phrase like \"cancelling our plan\".",
"severity": "Major",
"score_reduction": "4.0"
},
"error_1": {
"location": "\"I suddenly got an opportunity for a vacation so I took it.\"",
"aspect": "Lack of apology or remorse",
"explanation": "This sentence shows no remorse for cancelling the plan at the last minute. It should be replaced with a sentence that expresses regret for the inconvenience caused.",
"severity": "Major",
"score_reduction": "4.0"
},
"error_2": {
"location": "\"I would rather go for an adventure.\"",
"aspect": "Incorrect reason for cancellation",
"explanation": "This sentence implies that the reason for cancelling the plan was to go on an adventure, which is incorrect. The correct reason was illness. This sentence should be replaced with a sentence that correctly states the reason for cancellation.",
"severity": "Major",
"score_reduction": "4.0"
}
},
"raw_output": "..."
}
]
scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", use_vllm=True) # VLLM on GPU
TIGERScore supports VLLM fast inference. On a single A6000 (48GB) GPU, it only takes 0.2s - 0.3s for TIGERScore-13b to score each instance.
scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", quantized=True) # 4 bit quantization on GPU
By setting the initialization parameter quanitzed=True
, the model is set to be load in 4-bit version with hugging face load_in_4bit=True
option.
Please note that though using quantization would decrease the memory requirement by a large margin. You can run TIGERScore on about a 20+GB memory GPU. However, the inference speed might be slower than using the original bfloat16 version. It depends on you to make an trade-off.
scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B-GGUF", use_llamacpp=True)
We also provide the Llamacpp version of TIGERScore-7B/13B. By using the GGUF version we provided, you can run TIGERScore on pure CPU devices. It generally takes 20s for TIGERScore-13b to score each instance.
dataset preprocessing scripts and intermediate results can be found here
folder xgptscore
contains all the templates that we used to query ChatGPT or GPT-4 to get the identified errors in the hypothesis output for different tasks that TIGERScore involved. We call these API query methods as XGPTScore for a eXplanainable Scoring method by querying GPT Models.
The overall pipeline of XGPTScore is:
- We define a query template that askes GPT Models to idnetify errors in the hypothesis output based on the task instruction, source text and reference text.
- We mannual construct various evaluation aspects to focus on for different tasks. (
./constants.py
) - Then, by applying the templates and also specifiy the aspects to focus on in the template, GPT Models are required to return the identified errors in a predefined format (like json format).
Check xgptscore/README.md
for more details. And how to use our query template with a single function xgptscore()
MetricInstruct consists of data from 2 sampling channels, real-world channel and synthetic channel.
- The real-world channel data is generated by script
generate_distill_data.sh
. - The synthetic channel data is generated by script
generate_synthesis_distill_data.sh
. The overall purpose of 2 channel data collection is to make sure we cover as many as error types in the training data so that our model generalize better.
After getting these data, we do a series heuristics to filter our bad data and augment data:
- Drop item that is too long, too short, bad format, etc (pattern matching)
- Propmt GPT-4 to drop item with unreasonable error analysis contents (
check_data.sh
) - Our evaluation asepcts might be limited because they are mannually defined and fixed. Therefore, we propose to generate high-quality outputs with free-form error asepcts using
generate_inst_synthetic_data.sh
as a supplement to the synthetic channel.
You can load our preprocessed data used to finetune TIGERScore-V1 from hugging face 🤗 directly:
from datasets import load_dataset
dataset = load_dataset("TIGER-Lab/MetricInstruct")
We provide our training and testing scripts in folder finetune
, where we use🧮
finetune_llama.sh
to finetine the model.format_distill_data.sh
to transform the data into the format for finetuning, that is, a sinlge instruction and input context with an output.test_llama_vllm.sh
to test and compute the correlation as the performance of our finetuned model. Please check these scripts to know more details of our training and testing process.- 'eval_baseline.sh to restore baseline experiments results. See
./tigerscore/common/README.md
to install the env.
Please cite our paper if you fine our data, model or code useful.
@article{Jiang2023TIGERScoreTB,
title={TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks},
author={Dongfu Jiang and Yishan Li and Ge Zhang and Wenhao Huang and Bill Yuchen Lin and Wenhu Chen},
journal={ArXiv},
year={2023},
volume={abs/2310.00752},
url={https://api.semanticscholar.org/CorpusID:263334281}
}