Due 1/9 (Th), 12:30pm
The Internet is full of published linguistic data sets. Let's data-surf! Instructions:
- Go out and find two linguistic data sets you like. One should be a corpus, the other should be some other format. They must be free and downloadable in full. Make sure they are linguistic data sets, meaning designed for linguistic inquiries.
- You might want to start with various bookmark sites listed in the following Learning Resources sections: Linguistic Data, Open Access, Data Publishing, and Corpus Linguistics. But don't be constrained by them.
- Download the data sets and poke around. Open up a file or two to take a peek. (No need to do this in Python.)
- In a text file (should have the
.txt
extension), make note of:
- The name of the data resource
- The author(s)
- The URL of the download page
- Its makeup: size, type of language, format, etc.
- License: whether it comes with one, and if so what kind?
- Anything else noteworthy about the data. A sentence or two will do.
- If you are comfortable with markdown, make an
.md
file instead of a text file.
SUBMISSION: Upload your text file to To-do1 submission link, on CourseWeb. If you do not have CourseWeb access, email your submission to Jevon cc Cassie and John.
Due 1/16 (Th), 12:30pm
Learn about the numpy
library: study the Python Data Science Handbook and/or the NumPy documentation here.
While doing so, create your own study notes, as a Jupyter Notebook file entitled numpy_notes_yourname.ipynb
.
Include examples, explanations, etc. Replicating DataCamp's examples is also something you could do.
You are essentially creating your own reference material.
SUBMISSION: Your file should be in the todo2/
directory of the Class-Exercise-Repo
.
Make sure it's configured for the "upstream" remote and your fork is up-to-date. Push to your GitHub fork, and create a pull request for me.
Due 1/21 (Tue)
Study the pandas
library (through the Python Data Science Handbook and/or the documentation. pandas
is a big topic with lots to learn: aim for about 1/2. While doing so, try it out on TWO spreadsheet (.csv, .tsv, etc.) files:
- The first file should be your choice. You can get one from this CSV Files archive, or make up your own. Keep it super simple! It's supposed to be a toy dataset.
- The second one should be
billboard_lyrics_1964-2015.csv
by Kaylin Pavlik, from her project '50 Years of Pop Music'. (Note: you might need to specify ISO8859 encoding.)
Name your Jupyter Notebook file pandas_notes_yourname.ipynb
. Don't change the filename of any downloaded CSV files or edit them in any way.
SUBMISSION: Your files should be in the todo3/
directory of Class-Exercise-Repo
.
Commit and push all three files to your GitHub fork, and create a pull request for me.
Due 1/23 (Thu)
This one is a continuation of To-do #3: work further on your pandas
study notes. You may create a new JNB file, or you can expand the existing one. Also: try out a spreadsheet submitted by a classmate. You are welcome to view the classmate's notebook to see what they did with it. (How to find out who submitted what? Git/GitHub history of course.) Give them a shout-out.
SUBMISSION: We'll stick to the todo3/
directory in Class-Exercise-Repo
. Push to your GitHub fork, and create a pull request for me.
Due 1/30 (Thu)
For this To-do, refer back to the edited version of english.csv
from class Activity 3. Add a markdown cell block to your Jupyter Notebook file for activity3 clearly labeling the beginning of To-do #5.
This time we'll look at the response times for the naming task (RTnaming
). The equipment that Balota et al. used to gather this naming data was voice-activated. As such, the acoustic properties of a word's initial segment may have affected the time it took to register a response. Let's figure out whether it did.
- Inspect the distribution of naming latencies.
- Plot two histograms for the naming latencies, with different bin sizes.
- Plot the density of the naming latencies. Is this a normal distribution?
- The column
Voice
specifies whether a word's initial phoneme was voiced or voiceless. Make a boxplot for the distribution of reaction times across voiced and voiceless phonemes, grouped by subject age.
SUBMISSION: Submit a pull request including your updated JNB file.
Due 2/11 (Tue)
The Gries & Newman article cites many famous corpora and corpus resources. Let's round them all up in a single spot, complete with web links. We will collaborate on a shared document called 'corpora_tools_list.md'
.
- The
Class-Plaza
repo belongs to all of us: we are all listed as a collaborator. - That means all of us has full read and write access: in GitHub's lingo, we have 'push access'.
- Which means no need to fork; you should directly clone. After that, push and pull directly.
Your job is to fill out the three tables: add at least one entry to each table. Make sure you are not duplicating someone else's entry. Because everyone is editing the same document, you may run into a conflict while trying to push. Make sure you have read and understood this tutorial on Git conflicts and resolve accordingly.
SUBMISSION: There is no formal submission process, because this one does not involve you issuing a pull request or anything like that. I will check on the repo later to see you have indeed made your contribution.
Due 2/13 (Thu)
Let's pool our questions together for Dr. Lauren Collister and Dominic Bordelon, who will be our guest speakers on Thursday.
Review the topics of linguistic data, open access, and data publishing, focusing in particular on these three resources: Data Management Plans for Linguistic Research, Kitzes (2018), and the Copyright and Intellectual Property Toolkit.
Think of a question for Lauren, and add yours along with your name to the questions_coll_bord.md
file in our Class-Plaza
repo.
SUBMISSION: Push your commit directly to the Class-Plaza
repo. Make sure you don't trample on someone else's contribution. If there is a conflict, it is your job to resolve it.
Due 2/18 (Tue)
Let's try Twitter mining! On a tiny scale that is. This blog post Data Analysis using Twitter presents an easy-to-follow, step-by-step tutorial, so you should follow it along.
First, you will need to install the tweepy
library:
- Option 1: Install through Anaconda Navigator. See this screenshot.
- See if Tweepy is installed. It likely isn't -- download and install it.
- Option 2: Manual installation through pip. See these screenshots.
- If
which pip
does not show your Anaconda version of pip, it means you cannot simply gopip install tweepy
. You will instead have to specify the complete path for Anaconda's pip. So, find your Anaconda installation path, and install Tweepy like so:/c/ProgramData/Anaconda3/Scripts/pip install tweepy
- Your pip path might be something like
/c/Users/your-user-name/Anaconda3/...
. You should use TAB completion while typing out the path. - If you are having trouble finding your Anaconda's path, try
which -a python
. The-a
flag shows all python executables found in your path.
- If
Notes on using tweepy
:
- If you don't have a Twitter account, you will have to create one first. And then, you should create an API account.
- This is exciting stuff, but don't go overboard! 100 Tweets are enough for this exercise. Overloading API without taking proper cautionary steps is a sure-fire way to get yourself banned from tech sites.
- You will be using your 'Consumer Key' and 'Consumer Secret' in your code. You should not be sharing them! Right before committing, redact them in your JNB file by changing the string values to 'XXXXXXXXXXXXXX'.
SUBMISSION: We are switching back to Class-Exercise-Repo
; use the todo8/
folder. Your Jupyter Notebook file should have your name in the file name. Push to your fork and create a pull request. Make sure you have redacted your personal API keys!
Due 2/20 (Thu)
What have the previous students of LING 1340/2340 accomplished? What do finished projects look like? Let's have you explore their past projects. Details:
- We'll collaborate on a single file:
todo9_past_project_critiques.md
inClass-Plaza
. - Pick two projects, but you can't pick one that already has three critiques. (If someone gets the last spot while you're working, you'll have to pick a new one. And: declaring "dibs" is allowed, as long as you intend to finish the work within the next hour.)
- Create a section for yourself. Provide a link to the project repo.
- Your critique should consist of: one thing you thought was done well, one avenue for improvement, and one thing you learned.
SUBMISSION: Push your commit directly to the Class-Plaza
repo. Make sure you don't trample on someone else's contribution. If there is a conflict, it is your job to resolve it.
Due 2/27 (Thu)
Let's try sentiment analysis on movie reviews. Follow this tutorial from PyLing in your own Jupyter Notebook file. Feel free to explore and make changes as you see fit. If you haven't already, review the Python Data Science Handbook chapters to give yourself a good grounding. Also: watch these (or equivalent) YouTube tutorials Scikit-Learn Tutorial | Machine Learning With Scikit-Learn, and NLP Tutorial with Python & NLTK.
Students who took LING 1330: compare sklearn's Naive Bayes with NLTK's treatment and include a blurb on your impression. (You don't have to run NLTK's code, unless you want to!)
SUBMISSION: Your jupyter notebook file should be in the todo10
folder of Class-Exercise-Repo
. As usual, push to your fork and create a pull request.
Due 3/3 (Tue)
What has everyone been up to? Let's take a look -- it's a "visit your classmates" day!
- First off, prepare your own "Guestbook" file. It's already been created in the
Class-Plaza
repo, but you should edit it so that:- It has your project title and a link to your repo, and your name
- And a bit of personalization if you want, like a greeting.
- Now visit your classmates' projects! We will go in alphabetical order by first name, as seen in the directory. You should visit two people after you (Anthony: Joey and Jordan; Joey: Jordan and Juan; Natasha: Sean and Anthony; etc.)
- Take a look around, and write on their guestbook. (You don't have to wait until it's prepped.) Like the previous To-do, your entry should consist of: one thing you thought was done well, one avenue for improvement or suggestion, and one thing you learned.
SUBMISSION: Since Class-Plaza
is a fully collaborative repo, there is no formal submission process.
Due 3/31 (Tue)
Let's poke at big data. Well, big-ish. The Yelp Open Dataset is a subset of their review data, available specifically for use in educational and academic contexts. Let's check it out! Before we begin:
- You will need a fairly stable internet connection and at least 14GB of free hard drive space.
- Provision enough time. Downloading the dataset alone may take 25 minutes or longer.
- If your laptop is fairly old or running out of space, let me know!
- After downloading the data set, you should operate exclusively in a command-line environment, utilizing unix tools.
- I am supplying general instructions below, but you will have to fill in the blanks between steps, such as cd-ing into the right directory, invoking your Anaconda Python and finding the right file argument.
- You will be submitting a short (paragraph-length -- this is just a To-do!) write-up as a Markdown file named
yelp_tryout_yourname.md
in theto-do12/
directory ofClass-Exercise-Repo
.
Let's download this beast and poke around.
- Download the JSON portion of the data. (We don't need the photos.)
- Move the downloaded archive file into your
Documents/Data_Science
directory. You might want to create a new folder there for the data files. - From this point on, operate exclusively in command line.
- The file is in the
.tar
format. Look it up if you are not familiar. Untar it usingtar -xvf
. It will extract 6 json files along with some PDF documents. - Using various unix commands (
ls -lhAF
,head
,tail
,wc -l
, etc.), find out: how big are the json files? What do the contents look like? How many reviews are there? - How many reviews use the word 'horrible'? Find out through
grep
andwc -l
. Take a look at the first few throughhead | less
. Do they seem to have high or low stars? - How many reviews use the word 'scrumptious'? Do they seem to have high stars this time?
How much processing can our own puny personal computer handle? Let's find out.
- First, take stock of your computer hardware: disk space, memory, processor, and how old it is.
- Create a Python script file:
process_reviews.py
. Content below. You can use nano, or you could use your favorite editor (atom, notepad++) provided that you launch the application through command line.
import pandas as pd
import sys
from collections import Counter
filename = sys.argv[1]
df = pd.read_json(filename, lines=True, encoding='utf-8')
print(df.head(5))
wtoks = ' '.join(df['text']).split()
wfreq = Counter(wtoks)
print(wfreq.most_common(20))
- We are NOT going to run this on the whole
review.json
file! Start small by creating a tiny version consisting of the first 10 lines, namedFOO.json
, usinghead
and>
. - Then, run
process_reviews.py
onFOO.json
. Note that the json file should be supplied as command-line argument to the Python script. Confirm it runs successfully. - Next, re-create
FOO.json
with incrementally larger total # of lines and re-run the Python script. The point is to find out how much data your system can reasonably handle. Could that be 1,000 lines? 100,000? - While running this experiment, closely monitor the process on your machine. Windows users should use Task Manager, and Mac users should use Activity Monitor.
- Finally, write up a short reflection summary as
yelp_tryout_yourname.md
. A paragraph will do. How was your laptop's handling of this data set? What sorts of resources would it take to successfully process it in its entirety and through more computationally demanding processes? Any other observations?
SUBMISSION: Your markdown file should be in the todo12
directory in Class-Exercise-Repo
. Make sure you use the naming convention described here! As usual, push to your fork and create a pull request.
Due 4/2 (Thu)
It's "visit your classmates" day, round 2!
- First, maintain your own guestbook. Respond to your classmates' visit logs.
- No need to post a lengthy response -- you are not starting a debate here! You can think of it as something akin to Facebook's "like", just a small acknowledgment and a thank you.
- Now visit your classmates' projects. Like before we go by the order of first names, which can be found right in the directory. You should visit two people after your previous visits.
- Take a look around, and write on their guestbook. Like the previous visit, your entry should consist of: one thing you thought was done well, one avenue for improvement or suggestion, and one thing you learned.
SUBMISSION: Since Class-Plaza
is a fully collaborative repo, there is no formal submission process.
Due 4/16 (Thu)
Visit your classmates, round 3.
- You know what to do! This time, visit all three projects you haven't visited yet.