James will be delivering this all day workshop on Tuesday the 19th of March. This repository contains preparatory details and the files used in when delivering the workshop.
The workshop covers some methods for downloading, analysing and visualising social media data using the R programming language. We use the 'tidyverse' in R and (optionally) the spacy python module for natural language processing.
The structure of the workshop is as follows
Stage | Title | Detail | R package(s) |
---|---|---|---|
Introduction | Overview of the day | ||
R intro | An introduction to R | ggplot2, tidyverse | |
Collection | Scraping | Downloading and filtering html pages | rvest, tidyverse, magittr, ggplot2, tibble |
API and data dumps | Accessing data directly using APIs | httr, jsonlite, dplyr, textclean, stringr, ggplot2, tidyverse, magittr, tibble, twitteR, RedditExtractoR | |
Analysis | Summarising | Tidyverse enabled summaries of our collected data | tidyverse, tidytext, dplyr, tidyr |
Text analysis | Applying numerical analysis to our text | tidytext, tidyverse, dplyr, stringr, RedditExtractoR, tidyr, igraph, ggraph, wordcloud, reshape2, tm, topicmodels | |
Natural Language | Optional section using the cleanNLP package | cleanNLP, tibble, tidyverse, RedditExtractoR, reticulate |
The preparation for the workshop is detailed here. Please follow the instructions and install the required software before or during the workshop.
In R, please run the following code to install all the above R package.
our_packages <- c('tidyverse',
'ggplot2',
'rvest',
'jsonlite',
'httr',
'dplyr',
'textclean',
'stringr',
'magittr',
'tibble',
'twitteR',
'RedditExtractoR',
'tidytext',
'tidyr',
'igraph',
'ggraph',
'wordcloud',
'reshape2',
'tm',
'topicmodels',
'cleanNLP',
'reticulate')
install.packages(our_packages)
If you recieve a message about requiring code to be compiled then type in no and press enter.
For further information, please see the below links:
- R for Data Analysis. An excellent freely available book showing how one can use R for Data Science. The code uses the TidyVerse.
- Text Mining with R. Another text which uses the tidyverse set of R packages.
- Tidy Data. Journal article from 2014 detailing the idea behind 'tidy data'. Useful for understanding how the tidyverse is organised.
- cleanNLP. A 2017 journal article introducing the cleanNLP package which allows one to carry out natural language processing using pretrained machine learning models with output consistent with the principles of the tidyverse. This article explains well the rationale of the package and approach.
In the individual sections, additional links are provided. These are collected together below for your convenience.
- R for data science - An excellent book introducing the tidyverse approach to data science.
- Tidyverse - Web page detailing the tidyverse collection of package and how to use them.
- RStudio Cheat sheets - A whole collection of cheatsheets. The ggplot and dplyr sheets are perhaps most useful given the above.
- RStudio essentials - Videos available on the RStudio page.
- Rstudio Essentials of Data Science - Nice collection of videos on various data science topics in R.
- Datacamp tidyverse for beginners - A somewhat simpler cheat sheet which is great for those coming to the tidyverse for the first time.
We have looked at how one can save and filter an HTML page in R. In the web page we filtered the HTML based on the classes of divs. The following might be of interest:
Rvest is a little limited for social media data, but is very useful for downloading strutured data like tables from web pages. For examples of this, see:
- Wikipedia table scraping - Old but clear tutorial on scraping a wikipedia table
- DataCamp community tutorial - a more advanced tutorial on using rvest to scrape data from the Trustpilot page and carry out an interesting analysis.
- R view - another wikipedia scraping example with rvest
If you want to scrape page with infinite scrolling (such as twitter) from within the browser, then the dataminer extension for Google chrome is useful, along with the autoscroll extension. Below are several useful resources for this extension.
Note You will need to have a google account to use this service. Be aware this may result in DataMiner being aware your activities.
If you want to go further with these techniques then the following might be useful:
- httr vignette about API access
- RStudio Conference 2017 talk on accessing web APIs
- Rstudio video on accessing web APIs
- The rTweet package looks like a better way to access Twitter via R. However, I have not tried it before this workshop.
There are lots of different packages for collecting social media data. The vosonSML package is a front end to the rtweet and RedditExtractoR packages. It appears to have some excellent network analysis component and is most certainly worth you looking into.
In addition:
- DMI-TCAT - A software package (not in R) which allows one to download and archive data Twitter data. The data can then be analysed using a web interface. We have several TCAT servers set up at CIM.
- NetVizz - A Facebook APP for downloading data.
- YouTube Data Tools - A free app for downloading YouTube data. You can download the data and then load it into R using the command read.csv(filename, sep = '\t'). Where filename is the filename of your downloaded file.
- The tidytext section of the useful 'Text mining with R' book
- A gentle guide to Tidy Statistics in R by Thomas Mock - A nice overview of the tidyverse
- The tidy tools manifesto - Hadley Wickham taking a manifesto approach
- Pipe section of the R for data science book - in case the pipe requires additional clarification
- Data Carpentry lesson: R for social scientists - an R workshop specifically for social scientists which you may find quite approachable
The best way to understand these tools is to use them, play with them, break them and produce something. A quick online search reveals some r data sets for free. Loading some of these datasets looks a little tricky.
You can always try FiveThirtyEight data sets - a news organisation which has released all their data in the fivethirtyeight R package
The Text Mining with R book chose the analysis we use in this session to demonstrate their package. These methods are still used. A nice recent review of the different packages and methods is Text Analysis in R. There are online paid course (for example DataCamp) but I have not taken these and you can learn a lot using free resources.
Machine learning is becoming a popular technique. Models which have been pretrained on large collections of text are used to categorise text. In the next segment, we will dip our toes into this approach. An excellent text is Deep learning with R. I heartily recomend reading through that book if you are interested in machine learning.
I have listed a few more resources below. These can be tricky and you are best going with the book and paper mentioned above.
- Computer science paper reviewing sentiment analysis so far
- Medium post on sentiment analysis and machine learning - Hard to read but a nice quick review
- Machine learning analysis with R.
Natual language processing is an interesting area. There are lots of packages (e.g., see the CRAN natural langauge processing view) and cleanNLP seems to be the best for easily integrating with the tidyverse. You may find the below links to be useful.
- Stanford CoreNLP - An excellent and classic library for natual language processing.
- Spacy homepage - The python package we used above. If you applying the above in your own projects then please work through this page to usderstand some of the nuances of the program.
- CleanNLP article - The paper contains some analysis which we did not do above. Furthermore, the references in the paper are useful for better understanding the approach.
- Detecting politeness - Natual language processing has the potential of identifying more nuanced components of anlaysis, such as politeness.
In general, you should find some data and play around. Try to get a sense of what your data is and understand what it is telling you.