Skip to content

Latest commit

 

History

History
51 lines (39 loc) · 1.36 KB

CHANGELOG.md

File metadata and controls

51 lines (39 loc) · 1.36 KB

Changelog

[2.1.0] - 2022-12-12

Added

  • m2 files
  • API access to sentence-split and tokenized version of source and target texts

Fixed

  • Many punctuation whitespace fixes -- contributed by @danmysak
  • Mismatch in length of sentence-split source and targets in some cases.

[2.0.0] - 2022-11-09

Added

  • 861 new documents (13,020 new sentences)!
  • Detailed annotations (22 error categories vs. 4 categories in v1)
  • GEC-only annotations
  • Multiple annotators per document (as indicated by doc.meta.annotator_id)

[1.3.0.dev0] - 2022-01-17

Added

  • Annotations may indicate newline insertion/deletion by using the "\n" token

Changed

  • Fix annotations in ~10 docs (lists, tables, newlines)
  • Sentence-split source and target files are now guaranteed to have the same number of lines

[1.2.1] - 2021-05-25

Changed

  • Fixed bug with is_sensitive metadata

[1.2.0] - 2021-05-25

Added

  • Sentence-level aligned data
  • Tokenized doc-level and sentence-level data

[1.1.1] - 2021-04-09

Added

  • Corpus.get_doc() method to find a document by id.

[1.1.0] - 2021-02-05

Added

  • is_sensitive metadata flag to mark documents that contain profanity.
  • stats.txt contains detailed dataset statistics

[1.0.0] - 2021-01-20

Added

  • 1,011 annotated documents (20,715 sentences)
  • A Python package, ua-gec to work with annotations