A list of corpora and corpus-related tools

Let's collaborate on building this document.
For "Access", indicate if the corpus is searchable online, needs purchasing, or freely downloadable.
Don't worry about putting resources in alphabetical order! Just add whatever you like, and make sure you are not adding entries that someone else has already listed.

Corpora mentioned in Gries & Newman

Name/link	Access	Summary
The British National Corpus	online or purchase	100 million word collection of samples of written and spoken language from multiple sources, designed to represent a wide cross-section of British English from the late 20th century
BNC Baby	purchase	"Baby" version of the BNC. Contains BNC sampler and Brown corpus.
Southern Oral History Program	online	6,000+ interviews with American Southerners available as audio, transcript, or both.
The Michigan Corpus of Academic Spoken English	online	The corpus consists of transcripts of close to 200 hours of spoken recordings, which totals into around 1.8 million words. The length of the recordings average between 17 to 178 minutes with word counts ranging from ~3,000 to ~30,000. The web interface allows for searching and browsing through the 152 different transcripts.
International Corpus of English	online	Multiple corpora representing different varieties of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989.
The Corpus of Contemporary American English	online purchase for full access	Samples of contemporary American English sampled from 1990 ongoing (wikipedia lists the last entry as 2017). It currently contains more than 560 million words, equally divided among spoken, fiction, popular magazines, newspapers and academic texts.
CALLHOME American English Speech	online	This consists of 120 unscripted 30 minure phone conversations between native English speakers. All calls originated in North America although 90 of those calls were made to outside North America. The corpus contains the speech data files with documentation describing their contents and format. The transcripts and an associated lexicon are available separately.
Buckeye Speech Corpus	online	The Buckeye Corpus of conversational speech contains high-quality recordings from 40 speakers in Columbus OH conversing freely with an interviewer. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software (Xwaves and Wavesurfer).

Additional corpora

Name/link	Access	Summary
The Corpus of Electronic Texts	online	Contains Texts from as early as the 400s AD, originals in translated English from the original Irish, French, Middle English, Latin, Italian, Spanish, and German. Originals also available. Total of 19 million words in 1638 documents from many topics
Github Issues Corpus	online	Titles, descriptions and metadata for more than 8 million issues (like bug reports but for GitHub repositories) created on GitHub in 2017.
European Parliament Proceedings Parallel Corpus 1996-2011	downloadable	This corpus is extracted from proceedings of the European Parliament. It includes 21 EU languages and was created for the purpose of machine translation. The documents include information about the speaker language.
German Reference Corpus	online	DeReKo consists of over 50 million words of contemporary German language text as of 2020 and is one of the largest German language corpora
The Wikipedia Corpus	online	Contains the full text of wikipedia, over 1.9 billion words

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpora_tools_list.md

corpora_tools_list.md

A list of corpora and corpus-related tools

Corpora mentioned in Gries & Newman

Additional corpora

Files

corpora_tools_list.md

Latest commit

History

corpora_tools_list.md

File metadata and controls

A list of corpora and corpus-related tools

Corpora mentioned in Gries & Newman

Additional corpora