Skip to content

wilkens-teaching/textmining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Mining the Novel

Contact

Matthew Wilkens, University of Notre Dame
CDT 30380 / ENGL 30010, Spring 2019
MW 11:00-12:15, 246 Hesburgh Library
Office hours: Th 9:00-5:00 (reserve slots), 320 Decio Hall.

Note: I'm generally in my office all day on Thursdays, but I do sometimes have a conflict or need to step out. Reservations are strongly recommended.

Summary

A technical, undergraduate-level course in quantitive and computational approaches to analyzing large bodies of text.

Description

Broadly speaking, the course covers text mining, content analysis, and basic machine learning, emphasizing (but not limited to) approaches with demonstrated value in literary studies. Students will learn how to clean and process textual corpora, extract information from unstructured texts, identify relevant textual and extra-textual features, assess document similarity, cluster and classify authors and texts using a variety of machine-learning methods, visualize the outputs of statistical models, and incorporate quantitative evidence into literary and humanistic analysis.

Most of the methods treated in the class are relevant in other fields. Students from all majors are welcome. No prerequisites, but some programming experience strongly recommended. Taught in Python. Counts toward the Digital Humanities track of the Idzik Computing and Digital Technologies (CDT) minor and as a free elective in the Data Science minor.

Texts

Required

Bengfort, Benjamin, Rebecca Bilbro, and Tony Ojeda. Applied Text Analysis with Python. O'Reilly, 2018.
See the book's associated GitHub repo for code samples and related data sets.

Optional

Guttag, John V. Introduction to Computation and Programming Using Python. 2nd. ed. MIT, 2016.
Useful for students without a strong background in Python. See linked MIT Press site for code samples and information about the associated EdX course.

Raschka, Sebastian and Vahid Mirjalili. Python Machine Learning. 2nd ed. Packt, 2017.
A more general-purpose textbook on machine learning. Greater emphasis on neural networks, less on working with textual data.

Objectives

The primary objective of the course is to build proficiency in applied text analysis and data mining. Students who complete the course will have knowledge of standard approaches to text analysis and will be familiar with the humanistic ends to which those approaches might be put. Secondary objectives include acquiring basic understanding of relevant literary history, of integrating quantitative with qualitative evidence, and of best practices in data science project management.

Work and grading

In addition to weekly problem sets, you will be required to complete one project proposal of about 1,000 words, a brief in-class presentation, and a final project that employs computational techniques covered in the course. Overall grades will be based on the problem sets (35% in sum), proposal (7%), presentation (3%), final project (40%), and class participation (15%). You must satisfactorily complete all assignments to pass the course.

Policy statements

Attendance

Two absences (one week of meetings), no questions asked. Additional absences will lower your grade.

Late work

Late work is generally not accepted. If you find yourself in exceptional circumstances, talk to me well in advance of the deadline and we may be able to find an accommodation.

Collaboration and plagiarism

Talking to other students -- especially those in the course -- about your ideas is a good thing. Taking other people's words, code, or ideas without attribution is plagiarism and will result in honor-code-related unpleasantness. When in doubt, cite. And feel free to ask me about specific cases or problems and about the mechanics of research documentation. For references and guidelines, see the library's plagiarism and documentation sites and the university's academic code of honor.

Disabilities

Students with documented disabilities who need accommodations or have questions should speak with me directly and contact Sara Bea Disability Services.

Schedule

Detailed assignments will be provided separately.

Note: All dates and assignments are subject to change. The schedule after spring break is tentative due to weather-related changes. Specific day-to-day assignments will be updated as the time draws nearer.

Week 1

  • Weds, 1/16. Introduction, background, mechanics.

Week 2

  • Mon, 1/21. No class (MLK Day).
  • Weds, 1/23.
    • Read Bengfort et al., chapter 1 (language and computation).
    • Due: Implement parse_gender function as described (pp. 10-12). Submit output for three literary texts from the class corpus (located on GitHub). [Answer].

Week 3

  • Mon, 1/28.
    • Read chapter 2 (corpora).
  • Weds, 1/30. No class (university closed due to severe weather).

Week 4

Week 5

Week 6-7. No formal classes (instructor travel).

Week 8

Week 9. Spring break. No class meetings.

Week 10

Week 11

Week 12

Independent work week (project proposal).

Use both class meetings this week to make progress on your final project. Attendance is optional, but strongly encouraged. Talk to your peers about your ideas and the state of your work. Listen to what they suggest and offer suggestions to them. The aim is to end the week with a solid idea of what you'd like to do, how you can do it, and what you'll need to get it done.

Week 13

  • Mon, 4/8.
    • Read chapter 9 (networks).
    • Due: project proposal. 300-500 words covering your research question, corpus, methods, and hypotheses.
  • Weds, 4/10.
    • Due: Reimplement the gender/nationality classification system using keyphrase, n-gram, entity, or other context-aware features as described in chapter 7. Evaluate the performance of the new model relative to the unigram original.

Week 14

Week 15

  • Mon, 4/22. No class meeting (Easter break).
  • Weds, 4/24.
    • Due: Read chapter 12 (deep learning).

Week 16

  • Mon, 4/29. Loose ends, takeaways.
  • Weds, 5/1. Presentations and conclusions.

Week 17

Final project due in lieu of exam, Weds, 5/8, 6:15pm.

About

Text Mining the Novel, Spring 2019

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published