- Let's collaborate on building this document.
- For "Access", indicate if the corpus is searchable online, needs purchasing, or freely downloadable.
- Don't worry about putting resources in alphabetical order! Just add whatever you like, and make sure you are not adding entries that someone else has already listed.
Name/link | Access | Summary |
---|---|---|
The British National Corpus | online or purchase | 100 million word collection of samples of written and spoken language from multiple sources, designed to represent a wide cross-section of British English from the late 20th century |
BNC Baby | purchase | "Baby" version of the BNC. Contains BNC sampler and Brown corpus. |
Southern Oral History Program | online | 6,000+ interviews with American Southerners available as audio, transcript, or both. |
The Michigan Corpus of Academic Spoken English | online | The corpus consists of transcripts of close to 200 hours of spoken recordings, which totals into around 1.8 million words. The length of the recordings average between 17 to 178 minutes with word counts ranging from ~3,000 to ~30,000. The web interface allows for searching and browsing through the 152 different transcripts. |
International Corpus of English | online | Multiple corpora representing different varieties of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. |
The Corpus of Contemporary American English | online purchase for full access | Samples of contemporary American English sampled from 1990 ongoing (wikipedia lists the last entry as 2017). It currently contains more than 560 million words, equally divided among spoken, fiction, popular magazines, newspapers and academic texts. |
CALLHOME American English Speech | online | This consists of 120 unscripted 30 minure phone conversations between native English speakers. All calls originated in North America although 90 of those calls were made to outside North America. The corpus contains the speech data files with documentation describing their contents and format. The transcripts and an associated lexicon are available separately. |
Buckeye Speech Corpus | online | The Buckeye Corpus of conversational speech contains high-quality recordings from 40 speakers in Columbus OH conversing freely with an interviewer. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software (Xwaves and Wavesurfer). |
Name/link | Access | Summary |
---|---|---|
The Corpus of Electronic Texts | online | Contains Texts from as early as the 400s AD, originals in translated English from the original Irish, French, Middle English, Latin, Italian, Spanish, and German. Originals also available. Total of 19 million words in 1638 documents from many topics |
Github Issues Corpus | online | Titles, descriptions and metadata for more than 8 million issues (like bug reports but for GitHub repositories) created on GitHub in 2017. |
European Parliament Proceedings Parallel Corpus 1996-2011 | downloadable | This corpus is extracted from proceedings of the European Parliament. It includes 21 EU languages and was created for the purpose of machine translation. The documents include information about the speaker language. |
German Reference Corpus | online | DeReKo consists of over 50 million words of contemporary German language text as of 2020 and is one of the largest German language corpora |
The Wikipedia Corpus | online | Contains the full text of wikipedia, over 1.9 billion words |