English Gigaword, Fifth Edition
Downloaded December 14, 2016
Complete information: https://catalog.ldc.upenn.edu/LDC2011T07
The original format of Gigaword is a collection of XML documents.
For this version, a handcrafted text extractor was run on the entire
corpus, and the outputs were concatenated together into a single
file. Furthermore, sentence splitting, word, tokenization and
lowercasing was applied, with each sentence written line-by-line and
ended with </s>
.