- m2 files
- API access to sentence-split and tokenized version of source and target texts
- Many punctuation whitespace fixes -- contributed by @danmysak
- Mismatch in length of sentence-split source and targets in some cases.
- 861 new documents (13,020 new sentences)!
- Detailed annotations (22 error categories vs. 4 categories in v1)
- GEC-only annotations
- Multiple annotators per document (as indicated by
doc.meta.annotator_id
)
- Annotations may indicate newline insertion/deletion by using the "\n" token
- Fix annotations in ~10 docs (lists, tables, newlines)
- Sentence-split source and target files are now guaranteed to have the same number of lines
- Fixed bug with
is_sensitive
metadata
- Sentence-level aligned data
- Tokenized doc-level and sentence-level data
Corpus.get_doc()
method to find a document by id.
is_sensitive
metadata flag to mark documents that contain profanity.stats.txt
contains detailed dataset statistics
- 1,011 annotated documents (20,715 sentences)
- A Python package,
ua-gec
to work with annotations