This repository hosts F1000RD, the accompanying dataset for the article Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review, Computational Linguistics (2022). It is the first openly licensed, multi-domain corpus of publications, their revisions and peer reviews from an open reviewing platform.
If you are interested in the intertextual graph data model that is introduced in the paper, please have a look at this repository
Abstract: Peer review is a key component of the publishing process in most fields of science. The increasing submission rates put a strain on reviewing quality and efficiency, motivating the development of applications to support the reviewing and editorial work. While existing NLP studies focus on the analysis of individual texts, editorial assistance often requires modeling interactions between pairs of texts -- yet general frameworks and datasets to support this scenario are missing. Relationships between texts are the core object of the intertextuality theory -- a family of approaches in literary studies not yet operationalized in NLP. Inspired by prior theoretical work, we propose the first intertextual model of text-based collaboration, which encompasses three major phenomena that make up a full iteration of the review-revise-and-resubmit cycle: pragmatic tagging, linking and long-document version alignment. While peer review is used across the fields of science and publication formats, existing datasets solely focus on conference-style review in computer science. Addressing this, we instantiate our proposed model in the first annotated multi-domain corpus in journal-style post-publication open peer review, and provide detailed insights into the practical aspects of intertextual annotation. Our resource is a major step towards multi-domain, fine-grained applications of NLP in editorial support for peer review, and our intertextual framework paves the path for general-purpose modeling of text-based collaboration.
The corpus is based on the data from the open reviewing platform F1000Research. The data used in the article consists of two parts: the study sample used in annotation studies, and the full crawl of F1000Research used for reference. This repository contains the study sample and accompanying analysis code. The full crawl used in this work is available on-demand. Our data comes in three formats: JATS XML (only full crawl) is used to generate Intertextual Graphs (ITG) -- our novel graph-based data model well-suited for intertextual analysis (https://github.com/UKPLab/intertext-graph.git). While ITGs require our external library to work with, we also provide our data in a simple CSV-based format (only study sample) to facilitate analysis and task-specific applications. Our data model is backed by the intertext_graph library released separately.
The repository also hosts the annotation guidelines used in the studies and the draft datasheet for F1000RD.
analysis/
analysis_util.py <- utility functions for analysing the data
analytics.ipynb <- code to reproduce analysis from the article
exp_linker.py <- simple regex-based explicit linker used in the paper
exp_patterns.tsv <- auxiliary for the explicit linker
data/
simple/ <- one file per task / analysis type
itg/ <- one folder per F1000Research submission
X-XX/ <- submission folder
{v1, v2, v3...}.json <- ITGs for submission versions
diff_... <- automatically produced alignments between v1 and v2 if available
reviews/ <- ITGs for reviews for the first submission version
linking/ <- links between reviews and the first submission version
guidelines/ <- annotation guidelines
requirements.txt
datasheet.pdf
If you want to use the ITG representation of the data in your experiments (e.g. v1.json
in the submission directories), have a closer look at our intertext_graph library. It is a general-purpose library that implements a structured data model for representing documents, making it easy to work with document structure, relations and cross-document links.
Based on the intertext_graph library, the function
get_mega_itg()
inanalysis/analysis_util.py
, builds an intertextual graph object from a submission directory with the complete pragmatics, linking and versioning data.
In the data/simple/imp_links.csv
table, all implicit linking annotations are shown. Each row has the information for one pair of nodes. In the implicit linking data, these are always sentence pairs. The columns imp_a
and imp_b
show the annotation from the main annotators in the main annotation study. The columns imp_a_re
and imp_b_re
show the annotations from the main annotators in the re-annotation study. The columns imp_c_e
and imp_d_e
show the annotations from the expert annotators. For all annotations, 1
indicates that annotators marked a sentence pair as linked, and 0
as non-linked.
The data was split by submission, ensuring that there is no overlap between train, dev and test set. Please find the split information in data/simple/splits.csv
.
To reproduce the analysis from the paper:
- Clone this repo
- Create a fresh virtual environment (e.g. via conda)
pip install -r requirements.txt
- Run the
analytics.ipynb
notebook in theanalysis
folder - Mind that some analysis will require the full crawl of F1000Research.
The dataset is licensed CC-BY-SA 4.0.
If you use this data in your research, please cite:
@article{10.1162/coli_a_00455,
author = {Kuznetsov, Ilia and Buchmann, Jan and Eichler, Max and Gurevych, Iryna},
title = "{Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review}",
journal = {Computational Linguistics},
pages = {1-38},
year = {2022},
month = {08},
issn = {0891-2017},
doi = {10.1162/coli_a_00455},
url = {https://doi.org/10.1162/coli\_a\_00455},
eprint = {https://direct.mit.edu/coli/article-pdf/doi/10.1162/coli\_a\_00455/2038043/coli\_a\_00455.pdf},
}
Don't hesitate to send us an e-mail or report an issue, if something is broken or if you have further questions!
Contacts: Ilia Kuznetsov [email protected], Jan Buchmann [email protected]
https://www.ukp.tu-darmstadt.de/
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.