Skip to content

Latest commit

 

History

History
27 lines (23 loc) · 4.27 KB

corpora_tools_list.md

File metadata and controls

27 lines (23 loc) · 4.27 KB

A list of corpora and corpus-related tools

  • Let's collaborate on building this document.
  • For "Access", indicate if the corpus is searchable online, needs purchasing, or freely downloadable.
  • Don't worry about putting resources in alphabetical order! Just add whatever you like, and make sure you are not adding entries that someone else has already listed.

Corpora mentioned in Gries & Newman

Name/link Access Summary
The British National Corpus online or purchase 100 million word collection of samples of written and spoken language from multiple sources, designed to represent a wide cross-section of British English from the late 20th century
BNC Baby purchase "Baby" version of the BNC. Contains BNC sampler and Brown corpus.
Southern Oral History Program online 6,000+ interviews with American Southerners available as audio, transcript, or both.
The Michigan Corpus of Academic Spoken English online The corpus consists of transcripts of close to 200 hours of spoken recordings, which totals into around 1.8 million words. The length of the recordings average between 17 to 178 minutes with word counts ranging from ~3,000 to ~30,000. The web interface allows for searching and browsing through the 152 different transcripts.
International Corpus of English online Multiple corpora representing different varieties of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989.
The Corpus of Contemporary American English online purchase for full access Samples of contemporary American English sampled from 1990 ongoing (wikipedia lists the last entry as 2017). It currently contains more than 560 million words, equally divided among spoken, fiction, popular magazines, newspapers and academic texts.
CALLHOME American English Speech online This consists of 120 unscripted 30 minure phone conversations between native English speakers. All calls originated in North America although 90 of those calls were made to outside North America. The corpus contains the speech data files with documentation describing their contents and format. The transcripts and an associated lexicon are available separately.
Buckeye Speech Corpus online The Buckeye Corpus of conversational speech contains high-quality recordings from 40 speakers in Columbus OH conversing freely with an interviewer. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software (Xwaves and Wavesurfer).

Additional corpora

Name/link Access Summary
The Corpus of Electronic Texts online Contains Texts from as early as the 400s AD, originals in translated English from the original Irish, French, Middle English, Latin, Italian, Spanish, and German. Originals also available. Total of 19 million words in 1638 documents from many topics
Github Issues Corpus online Titles, descriptions and metadata for more than 8 million issues (like bug reports but for GitHub repositories) created on GitHub in 2017.
European Parliament Proceedings Parallel Corpus 1996-2011 downloadable This corpus is extracted from proceedings of the European Parliament. It includes 21 EU languages and was created for the purpose of machine translation. The documents include information about the speaker language.
German Reference Corpus online DeReKo consists of over 50 million words of contemporary German language text as of 2020 and is one of the largest German language corpora
The Wikipedia Corpus online Contains the full text of wikipedia, over 1.9 billion words