Data:
We release all the datasets that we used to construct our benchmark. The zipped data file contains the following:
- Blogs Authorship Corpus: The raw XML files are in the
blogs
directory. We preprocess the blogs and organize them based on the industrial categories. The categories we use the data from are inprocessed-blogs.zip
. - IMDb62: The raw data is in
imdb62
. - Amazon Reviews: Given the large size of the Amazon 5-core reviews dataset, we only release the subset of reviews we selected in our benchmark in
amazon-reviews
We combine, annotate, and then split the data as we describe in our paper. All of which can be found in the following files:
data.json
: the raw combined data examples.data.json.annotated
: the annotated data examples with linguistic features.data.json.annotated.rst
: the annotated data examples after adding the RST relations.train.json, dev.json, test.json
: the train, dev, and test splits after discretizing the data. These are the files we use to train and evaluate our models.
We also make the outputs of GPT-3.5, the baseline Pythia 1B, and the prefix Pythia 1B models publicly available. They can be found in model_outputs