QSTEP Masterclass: Social Media

RStudio:

James will be delivering this all day workshop on Tuesday the 19th of March. This repository contains preparatory details and the files used in when delivering the workshop.

The workshop covers some methods for downloading, analysing and visualising social media data using the R programming language. We use the 'tidyverse' in R and (optionally) the spacy python module for natural language processing.

Outline

The structure of the workshop is as follows

Stage	Title	Detail	R package(s)
	Introduction	Overview of the day
	R intro	An introduction to R	ggplot2, tidyverse
Collection	Scraping	Downloading and filtering html pages	rvest, tidyverse, magittr, ggplot2, tibble
	API and data dumps	Accessing data directly using APIs	httr, jsonlite, dplyr, textclean, stringr, ggplot2, tidyverse, magittr, tibble, twitteR, RedditExtractoR
Analysis	Summarising	Tidyverse enabled summaries of our collected data	tidyverse, tidytext, dplyr, tidyr
	Text analysis	Applying numerical analysis to our text	tidytext, tidyverse, dplyr, stringr, RedditExtractoR, tidyr, igraph, ggraph, wordcloud, reshape2, tm, topicmodels
	Natural Language	Optional section using the cleanNLP package	cleanNLP, tibble, tidyverse, RedditExtractoR, reticulate

Preparation

The preparation for the workshop is detailed here. Please follow the instructions and install the required software before or during the workshop.

In R, please run the following code to install all the above R package.

our_packages <- c('tidyverse', 
                  'ggplot2', 
                  'rvest', 
                  'jsonlite', 
                  'httr', 
                  'dplyr', 
                  'textclean', 
                  'stringr', 
                  'magittr', 
                  'tibble', 
                  'twitteR', 
                  'RedditExtractoR', 
                  'tidytext', 
                  'tidyr', 
                  'igraph', 
                  'ggraph', 
                  'wordcloud', 
                  'reshape2', 
                  'tm', 
                  'topicmodels',
                  'cleanNLP',
                  'reticulate')

install.packages(our_packages)

If you recieve a message about requiring code to be compiled then type in no and press enter.

Post-workshop

For further information, please see the below links:

R for Data Analysis. An excellent freely available book showing how one can use R for Data Science. The code uses the TidyVerse.
Text Mining with R. Another text which uses the tidyverse set of R packages.
Tidy Data. Journal article from 2014 detailing the idea behind 'tidy data'. Useful for understanding how the tidyverse is organised.
cleanNLP. A 2017 journal article introducing the cleanNLP package which allows one to carry out natural language processing using pretrained machine learning models with output consistent with the principles of the tidyverse. This article explains well the rationale of the package and approach.

In the individual sections, additional links are provided. These are collected together below for your convenience.

R Intro

R for data science - An excellent book introducing the tidyverse approach to data science.
Tidyverse - Web page detailing the tidyverse collection of package and how to use them.
RStudio Cheat sheets - A whole collection of cheatsheets. The ggplot and dplyr sheets are perhaps most useful given the above.
RStudio essentials - Videos available on the RStudio page.
Rstudio Essentials of Data Science - Nice collection of videos on various data science topics in R.
Datacamp tidyverse for beginners - A somewhat simpler cheat sheet which is great for those coming to the tidyverse for the first time.

Scraping

We have looked at how one can save and filter an HTML page in R. In the web page we filtered the HTML based on the classes of divs. The following might be of interest:

Rvest is a little limited for social media data, but is very useful for downloading strutured data like tables from web pages. For examples of this, see:

Wikipedia table scraping - Old but clear tutorial on scraping a wikipedia table
DataCamp community tutorial - a more advanced tutorial on using rvest to scrape data from the Trustpilot page and carry out an interesting analysis.
R view - another wikipedia scraping example with rvest

If you want to scrape page with infinite scrolling (such as twitter) from within the browser, then the dataminer extension for Google chrome is useful, along with the autoscroll extension. Below are several useful resources for this extension.

Note You will need to have a google account to use this service. Be aware this may result in DataMiner being aware your activities.

API and Data Dumps

If you want to go further with these techniques then the following might be useful:

httr vignette about API access
RStudio Conference 2017 talk on accessing web APIs
Rstudio video on accessing web APIs
The rTweet package looks like a better way to access Twitter via R. However, I have not tried it before this workshop.

There are lots of different packages for collecting social media data. The vosonSML package is a front end to the rtweet and RedditExtractoR packages. It appears to have some excellent network analysis component and is most certainly worth you looking into.

In addition:

DMI-TCAT - A software package (not in R) which allows one to download and archive data Twitter data. The data can then be analysed using a web interface. We have several TCAT servers set up at CIM.
NetVizz - A Facebook APP for downloading data.
YouTube Data Tools - A free app for downloading YouTube data. You can download the data and then load it into R using the command read.csv(filename, sep = '\t'). Where filename is the filename of your downloaded file.

Summarising

The tidytext section of the useful 'Text mining with R' book
A gentle guide to Tidy Statistics in R by Thomas Mock - A nice overview of the tidyverse
The tidy tools manifesto - Hadley Wickham taking a manifesto approach
Pipe section of the R for data science book - in case the pipe requires additional clarification
Data Carpentry lesson: R for social scientists - an R workshop specifically for social scientists which you may find quite approachable

The best way to understand these tools is to use them, play with them, break them and produce something. A quick online search reveals some r data sets for free. Loading some of these datasets looks a little tricky.

You can always try FiveThirtyEight data sets - a news organisation which has released all their data in the fivethirtyeight R package

Text Analysis

The Text Mining with R book chose the analysis we use in this session to demonstrate their package. These methods are still used. A nice recent review of the different packages and methods is Text Analysis in R. There are online paid course (for example DataCamp) but I have not taken these and you can learn a lot using free resources.

Machine learning is becoming a popular technique. Models which have been pretrained on large collections of text are used to categorise text. In the next segment, we will dip our toes into this approach. An excellent text is Deep learning with R. I heartily recomend reading through that book if you are interested in machine learning.

I have listed a few more resources below. These can be tricky and you are best going with the book and paper mentioned above.

Computer science paper reviewing sentiment analysis so far
Medium post on sentiment analysis and machine learning - Hard to read but a nice quick review
Machine learning analysis with R.

Natural Language Processing

Natual language processing is an interesting area. There are lots of packages (e.g., see the CRAN natural langauge processing view) and cleanNLP seems to be the best for easily integrating with the tidyverse. You may find the below links to be useful.

Stanford CoreNLP - An excellent and classic library for natual language processing.
Spacy homepage - The python package we used above. If you applying the above in your own projects then please work through this page to usderstand some of the nuances of the program.
CleanNLP article - The paper contains some analysis which we did not do above. Furthermore, the references in the paper are useful for better understanding the approach.
Detecting politeness - Natual language processing has the potential of identifying more nuanced components of anlaysis, such as politeness.

In general, you should find some data and play around. Try to get a sense of what your data is and understand what it is telling you.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
1_r_intro		1_r_intro
2_scraping		2_scraping
3_apis_datadumps		3_apis_datadumps
4_summarising		4_summarising
5_text_analysis		5_text_analysis
6_natural_language		6_natural_language
preparation		preparation
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
qstep-socialmedia.Rproj		qstep-socialmedia.Rproj
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QSTEP Masterclass: Social Media

Outline

Preparation

Post-workshop

R Intro

Scraping

API and Data Dumps

Summarising

Text Analysis

Natural Language Processing

About

Releases 1

Packages

Languages

License

jamestripp/qstep-socialmedia

Folders and files

Latest commit

History

Repository files navigation

QSTEP Masterclass: Social Media

Outline

Preparation

Post-workshop

R Intro

Scraping

API and Data Dumps

Summarising

Text Analysis

Natural Language Processing

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages