Skip to content

Latest commit

 

History

History
48 lines (33 loc) · 3.7 KB

data.md

File metadata and controls

48 lines (33 loc) · 3.7 KB
title layout nav_order
Data
default
7

Workshop Preparation

This page contains the datasets for the text and data mining workshop. There is contextual information about each collection of documents, including links to the original location they were derived from. Choose one dataset to work with in class. Click on the dataset title (eg Adult British Fiction) to download the dataset.

Choose a dataset to work with

  1. Adult British Fiction

Fiction from the 1880s. Sample corpora assembled from Project Gutenberg by students in [Alan Liu's English 197 course](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets), Fall 2014 at UC Santa Barbara.
  1. [Watergate Scandal](data/Dataset 2 - Watergate Scandal News Coverage-20210122T202246Z-001.zip)

Dataset compiled for the Fundamentals of Text Mining workshop using the Gale Digital Scholar Lab. OCR text sourced from: The International Herald Tribune Digital Archive, The Daily Mail, The Telegraph, The Sunday Times, and the Times Digital Archive. October 2019.
  1. [Inaugural Presidential Speeches](data/Dataset 3 - Inaugural Presidential Speeches-20210122T202249Z-001.zip)

Dataset of the inaugural speeches of every US president from Washington in 1789 to Trump in 2017, compiled by [Alan Liu on DH Resources for Project Building](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora).
  1. Feeding America

The Feeding America: The Historic American Cookbook dataset contains transcribed and encoded text from 76 influential American cookbooks held by MSU Libraries Stephen O. Murray and Keelung Hong Special Collections. Features encoded within the text include but are not limited to recipes, types of recipes, cooking implements, and ingredients. The 76 texts were chosen among more than 7000 cookbooks that MSU Libraries holds as representative of periods and themes in American cookbook history spanning the late 18th to early 20th century. Source: [Feeding America: The Historic American Cookbook Dataset. East Lansing: Michigan State University Libraries Special Collections](https://www.lib.msu.edu/feedingamericadata/)
  1. Billboard Hits

A collection of songs from popular 20th century artists, including The Beatles, Michael Jackson, Mariah Carey and Madonna.
  1. [19th Century Sunday School Texts](data/Dataset 6 - 19th C. Sunday School Texts-20210122T202256Z-001.zip) (data/Dataset 5 - #1 Billboard Hits-20210122T202254Z-001.zip)

The Sunday School Books in Nineteenth Century America dataset consists of 166 texts, including Sunday school books published between 1809 and 1887. The material reflects the emerging diversity of Protestant Christian denominations in the United States during that period. Additionally, texts included also mark the appearance of a theologically inflected genre of juvenile literature, which was published by a variety of sectarian presses. More contextual information is available [here](https://digital.lib.msu.edu/projects/ssb/?action=introessay) Source: East Lansing: [Michigan State University Libraries Special Collections](https://www.lib.msu.edu/ssbdata/)

You should also download this

to your local machine.

When you've chosen your dataset, downloaded it to your local machine along with the stopword list, you're ready to begin exploring the background to the field of text and data mining. Go to Module 1 to learn the basics of text mining.