This week covers:
- An intro to Git and Github for sharing code
- Command line tools
- R and the Tidyverse
Install tools: Ubuntu on Windows, GitHub for Windows, R, and RStudio
- Open http://aka.ms/wslstore and select Ubuntu on Windows
- If this seems like it's hanging, hit enter
- Create a username and password
- Updates all packages with
sudo apt-get update
andsudo apt-get upgrade
- Check that you have git under bash by typing
git --version
in the terminal - Install GitHub for Windows
- Download and install R from a CRAN mirror
- Download and install RStudio
- Open RStudio and install the
tidyverse
package, which includesdplyr
,ggplot2
, and more:install.packages('tidyverse', dependencies = T)
- You'll need a plain text editing program
- If you are familiar with emacs or vim, you can install them in Ubuntu with
sudo apt-get install emacs
orsudo apt-get install vim
- Otherwise consider Visual Studio Code, Atom, or Sublime
- Check your editor's settings for unix-friendly line endings
- Files that you create in Ubuntu on Windows get stored in a somewhat hidden location within the Windows filesystem
- To make it easier to find files you work on in Ubuntu, do the following:
- Open a bash shell
- Go to your home directory:
cd ~
- Create a symbolic link to your Documents folder:
ln -s /mnt/c/Users/<your name>/Documents ~/Documents
(if there's a space in your name you'll need to backslash escape it, a good tip here is to type just the first couple of letters of, say, your name, and use tab to autocomplete it) - Change to this directory:
cd ~/Documents
- Do all of your work, including the following section, from within this folder, which you'll be able to see under "Documents" in the Windows Explorer
- Sign up for a free GitHub account
- Then follow this guide to fork your own copy of the course repository
- Clone a copy of your forked repository, which should be located at
https://github.com/<yourusername>/coursework.git
, to your local machine - Once that's done, create a new file in the
week1/students
directory,<yourfirstname>.txt
(e.g.,jake.txt
) - Use
git add
to add the file to your local repository - Use
git commit
andgit push
to commit and push your changes to your copy of the repository - Then issue a pull request to send the changes back to the original course repository
- Finally, sync changes from the main repo to your fork with
git pull upstream master
(if your machine doesn't recognizeupstream
, do the following to create theupstream
shortcut:git remote add upstream https://github.com/msr-ds3/coursework.git
)
- Codecademy's interactive introduction to git
- A full hour-long introductory video
- More resources from GitHub available here and here
- And here's a handy cheatsheet
Think about how to write a musical_pairs.sh
script to determine your programming partner each day. We want the script to do the following:
- Produce a (pseudo)random pairing of 6 groups of 2 people who get to work together each day on pair programming assignments
- Any one of us should be able to run the script and get the same pairing on a given day (i.e., as long as our computers agree on the year/month/day)
- It's interesting to think about how we might avoid repeated pairs from one day to the next, but for a first cut (and maybe final cut) version of the script you can ignore that issue
- See this intro to the command line notebook
- Read through Lifehacker's command line primer
- See Linux Journey's shell lesson
- See this crash course for more details on commonly used commands
- Check out Software Carpentry's guide to the Unix shell
- Review this wikibook on data analysis on the command line, covering
cut
,grep
,wc
,uniq
,sort
, etc - Learn awk in 20 minutes
- Check out some more advanced tools for Data Science at the Command Line
- Do Codecademy's interactive command line tutorial (the free portion)
- See these Introduction to Counting slides
- Pull changes from the msr-ds3/coursework repo:
git pull upstream master
- Use the download_trips.sh file to download Citibike trip data by running
bash download_trips.sh
or./download_trips.sh
- Fill in solutions under each comment in citibike.sh using the
201402-citibike-tripdata.csv
file
- Make sure to save your work and push it to GitHub. Do this in three steps:
git add
andgit commit
and new files to your local repository. (Omit large data files.)git pull upstream master
to grab changes from this repository, and resolve any merge conflicts, commiting the final results.git push origin master
to push things back up to your GitHub fork of the course repository.
- Finish by submitting a pull request with your solutions so we can review them! (We won't merge the request, but it's a good way for the TA to provide feedback.)
- See the Data Wrangling in R slides
- Review intro_to_r.ipynb for an introduction to R
- See the intro and Chapters 2 and 4 of the 2nd edition of R for Data Science for background on using R and Rstudio (chapter numbers correspond to the online edition)
- See also Chapter 3 of the 2nd edition for the basics of dplyr
- Use the musical pairs script to determine your programming partner each day
- Fill in solutions to the counting exercises under each comment in citibike.R
- Do the following exercises from Chapter 5 of the 1st edition of R for Data Science:
-
Do the free portion of Codecademy's introduction to R, chapters 1, 2, and 3
-
References:
- Basic types: (numeric, character, logical, factor)
- Vectors, lists, dataframes: a one page reference and more details
- Cyclismo's more extensive tutorial
- Rstudio's data wrangling cheatsheet
- The tidyverse style guide
- Hadley Wickham's style guide
- The dplyr vignette
- Sean Anderson's dplyr and pipes examples (code on github)
-
Tutorials:
- DataCamp's introduction to R tutorials (or Hadley's Advanced R if you're a pro)
- DataCamp's Data Manipulation in R tutorial
- Datacamp's Introduction to the Tidyverse tutorial
- See the Data visualization slides
- Review visualization_with_ggplot2.ipynb for an introduction to data visualization with ggplot2
- Do the following exercises from Chapter 3 of the 1st edition of R for Data Science and do the following exercises:
- Citibike plots
- Run the load_trips.R file to generate
trips.RData
- Write code in plot_trips.R to create visualizations using
trips.RData
- Run the load_trips.R file to generate
- Read Chapters 1, 9, and 10 of the 2nd edition of R for Data Science on visualization
- Tutorials:
- DataCamp's Data Visualization with ggplot2 (part 1) tutorial
- References:
- RStudio's ggplot2 cheatsheet
- Sean Anderson's ggplot2 slides (code) for more examples
- The R Graphics Cookbook
- The official ggplot2 docs
- Videos on Visualizing Data with ggplot2
- Review combine_and_reshape_in_r.ipynb on joins with dplyr and reshaping with tidyr
- Finish up the Citibike plotting exercises in plot_trips.R, including the plots that involve reshaping data
- Read chapters 5 and 19 of the 2nd edition of R for Data Science on reshaping and joins
- Do the following exercises from the 1st edition of R for Data Science:
- Read Chapter 27 of the 1st edition of R for Data Science on Rmarkdown
- Do the following exercises from the 1st edition of R for Data Science:
- Do part 1 of Datacamp's Cleaning Data in R tutorial
- Additional references:
- The tidyr vignette on tidy data
- The dplyr vignette on two-table verbs for joins
- A visual guide to joins