Using the Proquest Times of India Corpus spanning 18XX--2008, we shed light on a slew of interesting questions.
-
Parse ToI Parse XML to CSV. Here's the data dictionary
-
- number of articles per issue (by pub_date/year/month/weekday--weekend)
- number of words per article/title over time
- number (proportion) of articles by contributor w/ TNN (rest are presumably sourced via AP etc. but good to groupby)
- gender, religion etc. of contributors -- histogram of top 50 names, surnames using naampy and pranaam
- number of contributors per article
- number of editorial/news
-
Other Ideas
- number of classified ads (on startpage == 1)
- number of ads (on startpage == 1)
- US vs. USSR/Russia etc.
- matrimonial ads: "caste no bar", 'fair complexion', etc.
- Need Annotation
- news/not news
- gov vs. not in ads
- episodic'---x happened vs. 'thematic' --- more detailed/contextual piece
- local/national/foreign news