Skip to content

Latest commit

 

History

History

gigaword-en-5

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

gigaword english v5

English Gigaword, Fifth Edition

Downloaded December 14, 2016

Complete information: https://catalog.ldc.upenn.edu/LDC2011T07

processed

txt

The original format of Gigaword is a collection of XML documents.

For this version, a handcrafted text extractor was run on the entire corpus, and the outputs were concatenated together into a single file. Furthermore, sentence splitting, word, tokenization and lowercasing was applied, with each sentence written line-by-line and ended with </s>.