This is the official code repository for the paper The Trickle-down Impact of Reward (In-)consistency on RLHF.
The four contrast instructions benchmark datasets can be found under data/contrast_instructions
.
Dataset | Task | Num. Examples | Link to file |
---|---|---|---|
StackExchange |
Question Answering | 1188 | stack_contrast.json |
WMT |
Machine Translation | 612 | wmt_contrast.json |
Twitter |
Paraphrase Identification | 289 | para_contrast.json |
RealSumm |
Summarization | 36 | sum_contrast.json |
Each example in the json file looks like this (example from WMT
) --
{
"query": "这一切,身在海外的华人华侨感受更为深刻。",
"retrieval": "身在上海,是一种亲历才懂的情感。",
"query_response_k": "All this, the overseas Chinese living overseas feel more deeply.",
"query_response_j": "All of this, the overseas Chinese feel even more deeply.",
"retrieval_response_k": "Being in Shanghai is a kind of emotion that you know.",
"retrieval_response_j": "Being in Shanghai is a kind of emotion that can only be understood through experience."
}
query
and retrieval
correspond to the two (inputs of) the instructions. *_response_k
is the human preferred response for query
and retrieval
respectively. *_response_j
is a less preferred response, that is NOT used in our reward consistency metrics.
We release the human preference data + splits for WMT
, Twitter
and RealSumm
under data/human_preferences
. The StackExchange
dataset can be found on Hugging Face datasets -- HuggingFaceH4/stack-exchange-preferences
.
WIP; We are still cleaning + organizing code for release. Please reach out to lshen30[at]jhu.edu
and sihaoc[at]cis.upenn.edu
for questions.
@article{shen2023trickle,
title={The Trickle-down Impact of Reward (In-)consistency on RLHF},
author={Lingfeng Shen and Sihao Chen and Linfeng Song and Lifeng Jin and Baolin Peng and Haitao Mi and Daniel Khashabi and Dong Yu},
year={2023},
journal={arXiv preprint arXiv:2309.16155},
url={https://arxiv.org/pdf/2309.16155.pdf}
}