Skip to content

A Practical Quality Metric for Semantic Role Labeling Systems Evaluation

License

Notifications You must be signed in to change notification settings

UniversalPropositions/PriMeSRL-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PriMeSRL-Eval

This repository contains the code for the EACL 2023 Findings paper PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling Systems Evaluation

We have our proposed evaluation of SRL quality and the official evaluation from the CoNLL2005 (span evaluation) and CoNLL2009 (head evaluation). We use an SRL format that extends the CoNLL-U format (see SRL format section below).

We have the following pipelines:

  • Proposed evaluation: uses our extended CoNLL-U+SRL format.
  • CoNLL2005 evaluation: CoNLL-U conversion to CoNLL2005 format for the official scripts.
  • CoNLL2009 evaluation: CoNLL-U conversion to CoNLL2009 format for the official scripts.

Data

Usage

  • (Recommended) Create a virtual env, e.g.
    • conda create -n eval python=3.9
    • conda activate eval
  • Install requirements: pip install -r requirements.txt
  • (Recommended) Verify unit tests (takes about 2 mins): pytest tests
    • It downloads the conll09 and conll05 evaluation scripts.
  • Run evaluation script:
    • python run_evaluations.py --gold-conllu <file> --pred-conllu <file> --format <str>--output-folder <folder>
      • --format can take following values: [conll09, conll05, conllu]
    • See the data and tests/data folder for examples.
    • Example usages:
      • python run_evaluations.py -g data/conll09/gold_file -p data/conll09/pred_file -f conll09 -o tmp

The evaluation script will show the results from the official CoNLL scripts and our proposed evaluation method. Please see the paper on how to interpret and compare these numbers.

Examples from our paper

We have encoded all examples in our paper as unit tests. See tests/README.md for how to match up numbers in the tests with those presented in the paper.

In short, the data for the tables are in the tests/data/<evaluation>/input with similar naming scheme to the examples in the table. The evaluation results presented in the paper are in these folders with this structure: tests/data/<evaluation>/expected/compare-*/comparison-results-[official-conll|proposed].csv.

We provide a script to format these comparison results, example usages:

  • python format_results.py -c tests/data/sense/expected/compare-sense_test-sense_pred_p1/comparison-results-official-conll.csv
  • python format_results.py -c tests/data/sense/expected/compare-sense_test-sense_pred_p1/comparison-results-proposed.csv

CoNLL-U+SRL format

We use an extended CoNLL-U format that replaces the MISC column with the additional columns below:

  • ISPRED - Flag Y when the token is a predicate, _ otherwise.
  • PREDSENSE - Predicate sense from PropBank.
  • ARGS - One column for each predicate, in order of appearance, i.e. first argument column contains the arguments for the first predicate.

Contribution

To contribute to this repository, particularly new unit tests of other interesting error combinations, please open an issue for discussion and any subsequent PR.

Citation

@misc{jindal2022primesrleval,
    title={PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling Systems Evaluation},
    author={Ishan Jindal and Alexandre Rademaker and Khoi-Nguyen Tran and Huaiyu Zhu and Hiroshi Kanayama and Marina Danilevsky and Yunyao Li},
    year={2022},
    eprint={2210.06408},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About

A Practical Quality Metric for Semantic Role Labeling Systems Evaluation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages