Skip to content

Latest commit

 

History

History
152 lines (94 loc) · 18.9 KB

README.md

File metadata and controls

152 lines (94 loc) · 18.9 KB

Information Science 3350/6350

Text mining for history and literature

Staff and sections

Instructor: Matthew Wilkens
Graduate TA: Maria Antoniak
Undergrad TAs: Jannie Li, Haley Mathews, and LeAnn McDowall

Term: Fall 2020
Credits: 3
Mode: Online synchronous

Lecture: MW 11:30-12:20
Sections: F 10:20-11:10am and 11:15am-12:05pm
Additional grad section: F 12:20-1:10pm
Office hours: See Canvas

Online sessions and resources: See the Mechanics section below.

Waitlist

If the course is full at registration time, you may add yourself to the waitlist. If and when you have been admitted, you will receive a PIN that will allow you to complete registration.

Summary

A course on the uses of text mining and other data-intensive techniques to assist analysis of textual humanities materials. Special emphasis on literary and historical documents. Intended for students with programming experience equivalent to CS 1110 (Intro to computing using Python).

Description

Broadly speaking, the course covers text mining, content analysis, and basic machine learning, emphasizing approaches with demonstrated value in literary studies and other humanistic fields. Students will learn how to clean and process textual corpora, extract information from unstructured texts, identify relevant textual and extra-textual features, assess document similarity, cluster and classify authors and texts using a variety of machine-learning methods, visualize the outputs of statistical models, and incorporate quantitative evidence into literary and humanistic analysis. The course will also introduce some of the more interesting recently published results in computational and quantitative humanities.

Most of the methods treated in the class are relevant in multiple fields. Students from all majors are welcome. Students with backgrounds in the humanities are especially encouraged to join.

Objectives and learning goals

The primary objective of the course is to build proficiency in text analysis and data mining for the humanities. Students who complete the course will have knowledge of standard approaches to text analysis and will be familiar with the humanistic ends to which those approaches might be put. Secondary objectives include acquiring basic understanding of relevant literary history, of integrating quantitative with qualitative evidence, and of best practices in small-scale project management.

Mechanics

Almost all of the work for the course will be conducted online. We will use:

  • Zoom for online meetings and office hours
  • GitHub to distribute lecture materials, code, and problem sets. The current version of the syllabus is always on GitHub, too.
  • Canvas to distribute restricted readings, to collect reading response posts, and to distribute video recordings of the lectures
  • CMS to collect problem sets and other code work
  • Campuswire for Q&A.

Links and detailed info about each of these are available via the course Canvas site.

Work and grading

Grades will be based on weekly problem sets (50% in sum), a midterm mini-project (15%), weekly problem sets (including the mini-project) (65% in sum), reading responses (10% in sum, see Canvas for details), a take-home final exam or optional final project (20%), and class participation (5%). You must achieve a passing grade in each of these components to pass the course.

Graduate students (enrolled in 6350) must complete a final project in place of the final exam and are strongly encouraged to attend an additional weekly section meeting for 6350 (Fridays, 12:20-1:10pm).

Texts and readings

There is no required textbook for the course. All assigned readings will be available online, either through the open web or via Canvas. See the Schedule below for details.

There are four textbooks that may be useful for students who wish to consult them. They are not required and most students will not need them.

Schedule

In general, Monday lectures will introduce new technical material. Wednesday sessions will combine technical instruction with discussion of assigned readings from the scholarly literature. Friday sections are smaller and devoted to focused work on problem sets and to follow-up questions about topics previously introduced.

All assignments and dates are subject to change. Additional readings are likely to be added after week 7.

Course materials and problem sets will be linked here as they become available. Problem sets will be distributed (via GitHub) no later than the Friday indicated on the syllabus and are due the following Tuesday evening (via CMS) unless otherwise indicated.

Week Monday Wednesday Friday
1 (8/31) No class Introduction Setup and dummy problem set 1
2 (9/7) Tokenization and counting.
Reading: Sentiment-aware tokenization
Optional: Jurafsky and Martin, "Regular Expressions, Text Normalization, and Edit Distance"
Readings: Problem set 2: Word clouds
3 (9/14) Sentiment scoring.
Reading: Syuzhet package
Optional: Jurafsky and Martin, "Lexicons for Sentiment, Affect, and Connotation"
Readings: Problem set 3: Sentiment and gender
4 (9/21) Vectorization, distance metrics, and regression.
Readings:
Readings:
  • Moretti, "Slaughterhouse of Literature" (Canvas)
  • Optional: Evert et al., "Understanding and Explaining Delta Measures" (Canvas)
Response 1 due no later than this session.
Problem set 4: Document similarity
5 (9/28) Clustering.
Reading: Grimmer and Stewart, "Text as Data"
Reading: Allison et al., "Quantitative Formalism" Problem set 5: Clustering with scikit-learn
6 (10/5) Classification I.
Reading: Underwood, "Understanding Genre in a Collection of a Million Volumes"
Readings:Response 2 due no later than this session. Problem set 6: Classifying novels (mini-project part 1)
7 (10/12) Classification II, including regularization and dimension reduction.
Reading: "The curse(s) of dimensionality"
No class (fall break) Problem set 7: Corpus building (mini-project part 2)
8 (10/19) Feature importance and explainability.
Reading: "The Importance of Human Interpretable Machine Learning"
Reading: Piper, "Characterization" (Canvas) Problem set 8: Mini-project part 3
9 (10/26) Hypothesis testing and confidence intervals.
Readings:
Reading: Problem set 9: Statistical testing
10 (11/2) NLP and feature expansion.
Reading: Spacy 101
Optional: Jurafsky and Martin, ch. 8, "Part of Speech Tagging"
Reading: Evans and Wilkens, "Nation, Ethnicity, and the Geography of British Fiction" Problem set 10: Extended features
11 (11/9) Topic models.
Reading: Boyd-Graber, Hu, and Mimno, Applications of Topic Models, chapters 1, 4, and 6.
Reading: Barron et al., "Individuals, Institutions, and Innovation in the Debates of the French Revolution"
Response 3 due no later than this session.
Open discussion, no new assignments
12 (11/16) No classes this week (semifinals). ----- -----
13 (11/23) No classes this week (Thanksgiving). ----- -----
14 (11/30) Word embeddings.
Reading: Ruder, "On Word Embeddings"
Reading: Nelson, "Leveraging the Alignment between Machine Learning and Intersectionality" (Canvas) Problem set 11: Word embeddings
15 (12/7) Deep learning and social media data.
Readings:
Readings: Review and project work.
16 (12/14) Wrap-up and flex time Summary discussion and conclusions.
Response 4 due no later than this discussion.
No class

Final exam

A final exam in the form of a take-home project is due on Saturday, 12/19, at 5:00pm EST via CMS. The exam is now available in the final_exam directory. You may work on it -- alone, not in a group -- as much or as little as you like until the due date.

Undergraduates (enrolled in 3350) may elect to complete a project in lieu of the exam. If you elect to take this route, you may work in a group of up to three students. The expected amount of work on the project will be scaled by the number of group members. Except in unusual circumstances, all group members will receive the same grade.

Graduate students (enrolled in 6350) must complete an independent project in place of the final exam.

Policies

COVID information

This is an unusual semester. Our goal is to keep one another safe, to cover as much material as possible, and to adapt to the circumstances as we find them.

Students and staff will adhere to the behavioral compact at all times. For in-person sections, you must remain in your assigned seat. If you do not have an assigned seat, do not come to your in-person section; instead, contact course staff for instructions on joining a remote section until you are cleared to return.

If you feel unwell in any way, or if you are not cleared through the daily check process, do not come to your in-person section.

Attendance and late work

This is a synchronous class of moderate size that will make frequent use of class time to discuss readings and to debate different approaches to academic inquiry. For this reason, attendance (virtual and physical, depending on the mode of your section) is required.

Students in highly displaced time zones and who have received individual permission are excused from attending the synchronous version of the Monday and Wednesday lectures. These students should watch the recorded lectures as soon as they are available and post any questions to Campuswire.

If you need to miss a class meeting, please complete the absence form before the meeting in question and watch the recorded video of the session you missed once it is available on Canvas. If you miss section on Friday, a recording may not be available. Consult with your section leader for appropriate steps. In every case, assigned work remains due at the appointed time.

Note: Participation is much more important than attendance. Your grade will not suffer if you make the wise decision to stay home when you might infect others.

Late work is accepted subject to a limit of five total slip days for the semester. You may submit one assignment up to five days late, or five assignments each one day late, or any other combination that does not exceed five late days in total. The slip day policy does not apply to the reading responses, which may not be submitted late, since they are tied to in-class activities.

If you expect to miss a deadline or to be absent for an extended period due to truly exceptional circumstances, contact me as far in advance as possible so that we can discuss potential accommodations.

Harassment and respect

All students are entitled to respect from course staff and from their fellow students. All staff are entitled to respect from students and from fellow staff members. Violations of this principle, whether large or small, will not be tolerated.

Respect means that your ideas are taken seriously, that you feel welcome in class settings (including in study groups and online fora), and that you are treated as a full, co-equal member of the class. Harassment describes any action, intentional or otherwise, that abridges the respect owed to every member of the class.

If you experience harassment in any form, or if you would like to discuss your experience in the class, please see me in office hours or contact me by email. The university also has reporting and counseling resources available, including those for sexual harassment and for other bias incidents.

Academic integrity

Each student in this course is expected to abide by the Cornell University Code of Academic Integrity. Any work submitted by a student in this course for academic credit will be the student's own work unless specifically and explicitly permitted otherwise.

Using other people's code is an important part of programming but, for group projects, the code should be substantially the work of the group members (except for standard libraries). Any code used in projects that was not written by the group members should be placed in separate files and clearly labeled with their source URLs. If you have benefitted from online resources such as StackOverflow, list the URLs in comments in your own code, even if you did not directly copy anything.

Project work that relates to your other classes or research is encouraged, but you may not recycle assignments. There must be no doubt that the work you turn in for this class was done for this class. When in doubt, consult with me during office hours.

Disabilities

Every student's access is important to us. If you have, or think you may have, a disability, please contact Student Disability Services for a confidential discussion: [email protected], 607-254-4545, or sds.cornell.edu.

  • Please request any accommodation letter early in the semester, or as soon as you become registered with Student Disability Services (SDS), so that we have adequate time to arrange your approved academic accommodations.
  • Once SDS approves your accommodation letter, it will be emailed to you and to me. Please follow up with me to discuss the necessary logistics of your accommodations.
  • If you are approved for exam accommodations, please consult with me at least two weeks before the scheduled exam date to confirm the testing arrangements.
  • If you experience any access barriers in this course, such as with printed content, graphics, online materials, or any communication barriers, reach out to me and/or your SDS counselor right away.
  • If you need an immediate accommodation, please speak with me after class or send an email message to me and to SDS.