Some important time-tested datasets used to build natural language processing models:
Public domain datasets with text data for use in Natural Language Processing (NLP).
Datasets (English, multilang)
Apache Software Foundation Public Mail Archives: all publicly available Apache Software Foundation mail archives as of July 11, 2011 (200 GB)
Blog Authorship Corpus: consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. 681,288 posts and over 140 million words. (298 MB)
Amazon Fine Food Reviews [Kaggle]: consists of 568,454 food reviews Amazon users left up to October 2012. Paper. (240 MB)
Amazon Reviews: Stanford collection of 35 million amazon reviews. (11 GB)
ArXiv: All the Papers on archive as fulltext (270 GB) + sourcefiles (190 GB).
ASAP Automated Essay Scoring [Kaggle]: For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. (100 MB)
ASAP Short Answer Scoring [Kaggle]: Each of the data sets was generated from a single prompt. Selected responses have an average length of 50 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students primarily in Grade 10. All responses were hand graded and were double-scored. (35 MB)
Classification of political social media: Social media messages from politicians classified by content. (4 MB)
CLiPS Stylometry Investigation (CSI) Corpus: a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. (on request)
ClueWeb09 FACC: ClueWeb09 with Freebase annotations (72 GB)
ClueWeb11 FACC: ClueWeb11 with Freebase annotations (92 GB)