data

Latest

Latest

balhafni released this 07 Feb 13:00

· 3 commits to master since this release

Data:

We release all the datasets that we used to construct our benchmark. The zipped data file contains the following:

Blogs Authorship Corpus: The raw XML files are in the blogs directory. We preprocess the blogs and organize them based on the industrial categories. The categories we use the data from are in processed-blogs.zip.
IMDb62: The raw data is in imdb62.
Amazon Reviews: Given the large size of the Amazon 5-core reviews dataset, we only release the subset of reviews we selected in our benchmark in amazon-reviews

We combine, annotate, and then split the data as we describe in our paper. All of which can be found in the following files:

data.json: the raw combined data examples.
data.json.annotated: the annotated data examples with linguistic features.
data.json.annotated.rst: the annotated data examples after adding the RST relations.
train.json, dev.json, test.json: the train, dev, and test splits after discretizing the data. These are the files we use to train and evaluate our models.

We also make the outputs of GPT-3.5, the baseline Pythia 1B, and the prefix Pythia 1B models publicly available. They can be found in model_outputs

Assets 3