Automatic Evaluation Metric described in the paper Difficulty-Aware Machine Translation Evaluation (ACL 2021).
@inproceedings{zhan-etal-2021-difficulty,
title = "Difficulty-Aware Machine Translation Evaluation",
author = "Zhan, Runzhe and Liu, Xuebo and Wong, Derek F. and Chao, Lidia S.",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
year = "2021",
publisher = "Association for Computational Linguistics",
pages = "26--32"
}
This repo wouldn't be possible without the awesome BERTScore, bert, fairseq, and transformers.
- Python version >= 3.6
- PyTorch version >= 1.0.0
Install it from the source by:
git clone https://github.com/NLP2CT/Difficulty-Aware-MT-Evaluation
cd Difficulty-Aware-MT-Evaluation
pip install --editable .
This package not only preserves the original functions of BERTScore (ver.0.3.7) but also can coexist with BERTScore.
Parameters | Descriptions |
---|---|
--with_diff | Score multiple MT systems using difficulty information. Otherwise, switch to the original BERTScore implementation. |
--cand_list | Paths of system output files. |
--save_path | Path of result file |
Please refer to BERTScore for other parameters.
For reproducing the the WMT19 En-De Top-6 scoring results, you can use the provided example files and CLI tool as follows:
WMT19_DATA_PATH=example_data/wmt19-ende
mkdir wmt2019_res/en-de/topk
da-bert-score --with_diff --batch_size 256 --lang de --ref ${WMT19_DATA_PATH}/ref/newstest2019-ende-ref.de \
--cand_list ${WMT19_DATA_PATH}/sys/en-de/newstest2019.Facebook_FAIR.6862.en-de ${WMT19_DATA_PATH}/sys/en-de/newstest2019.Microsoft-WMT19-sentence_document.6974.en-de ${WMT19_DATA_PATH}/sys/en-de/newstest2019.Microsoft-WMT19-document-level.6808.en-de ${WMT19_DATA_PATH}/sys/en-de/newstest2019.MSRA.MADL.6926.en-de ${WMT19_DATA_PATH}/sys/en-de/newstest2019.UCAM.6731.en-de ${WMT19_DATA_PATH}/sys/en-de/newstest2019.NEU.6763.en-de \
--save_path wmt2019_res/en-de/topk
For reprocuding other results, please download the WMT19 raw data from the official website.
This implementation follows the default behaviors (models, layers) of BERTScore when evaluating the different languages.
We will keep exploring the possible variants of DA-BERTScore.
The default parameter settings are used to reproduce the results reported in the paper. For achieving better correlation results across multiple languages, you can try one variant by enabling the parameter --ref_diff --softmax_norm --range_one
. The explanations are as follows:
Parameters | Descriptions |
---|---|
--ref_diff | The weight of the hypothesis word could be unknown if there is no identical word in the reference. By enabling this parameter, it will use the weight of corresponding reference word whose similarity score is maximal. |
--softmax_ norm | Use the softmax function to smooth the distribution of difficulty weight. |
--range_one | Scale the score to range [0,1] |