Skip to content

Latest commit

 

History

History
150 lines (105 loc) · 14 KB

README.md

File metadata and controls

150 lines (105 loc) · 14 KB

QSTEP Masterclass: Social Media

DOI RStudio: Binder

James will be delivering this all day workshop on Tuesday the 19th of March. This repository contains preparatory details and the files used in when delivering the workshop.

The workshop covers some methods for downloading, analysing and visualising social media data using the R programming language. We use the 'tidyverse' in R and (optionally) the spacy python module for natural language processing.

Outline

The structure of the workshop is as follows

Stage Title Detail R package(s)
Introduction Overview of the day
R intro An introduction to R ggplot2, tidyverse
Collection Scraping Downloading and filtering html pages rvest, tidyverse, magittr, ggplot2, tibble
API and data dumps Accessing data directly using APIs httr, jsonlite, dplyr, textclean, stringr, ggplot2, tidyverse, magittr, tibble, twitteR, RedditExtractoR
Analysis Summarising Tidyverse enabled summaries of our collected data tidyverse, tidytext, dplyr, tidyr
Text analysis Applying numerical analysis to our text tidytext, tidyverse, dplyr, stringr, RedditExtractoR, tidyr, igraph, ggraph, wordcloud, reshape2, tm, topicmodels
Natural Language Optional section using the cleanNLP package cleanNLP, tibble, tidyverse, RedditExtractoR, reticulate

Preparation

The preparation for the workshop is detailed here. Please follow the instructions and install the required software before or during the workshop.

In R, please run the following code to install all the above R package.

our_packages <- c('tidyverse', 
                  'ggplot2', 
                  'rvest', 
                  'jsonlite', 
                  'httr', 
                  'dplyr', 
                  'textclean', 
                  'stringr', 
                  'magittr', 
                  'tibble', 
                  'twitteR', 
                  'RedditExtractoR', 
                  'tidytext', 
                  'tidyr', 
                  'igraph', 
                  'ggraph', 
                  'wordcloud', 
                  'reshape2', 
                  'tm', 
                  'topicmodels',
                  'cleanNLP',
                  'reticulate')

install.packages(our_packages)

If you recieve a message about requiring code to be compiled then type in no and press enter.

Post-workshop

For further information, please see the below links:

  • R for Data Analysis. An excellent freely available book showing how one can use R for Data Science. The code uses the TidyVerse.
  • Text Mining with R. Another text which uses the tidyverse set of R packages.
  • Tidy Data. Journal article from 2014 detailing the idea behind 'tidy data'. Useful for understanding how the tidyverse is organised.
  • cleanNLP. A 2017 journal article introducing the cleanNLP package which allows one to carry out natural language processing using pretrained machine learning models with output consistent with the principles of the tidyverse. This article explains well the rationale of the package and approach.

In the individual sections, additional links are provided. These are collected together below for your convenience.

R Intro

Scraping

We have looked at how one can save and filter an HTML page in R. In the web page we filtered the HTML based on the classes of divs. The following might be of interest:

Rvest is a little limited for social media data, but is very useful for downloading strutured data like tables from web pages. For examples of this, see:

  • Wikipedia table scraping - Old but clear tutorial on scraping a wikipedia table
  • DataCamp community tutorial - a more advanced tutorial on using rvest to scrape data from the Trustpilot page and carry out an interesting analysis.
  • R view - another wikipedia scraping example with rvest

If you want to scrape page with infinite scrolling (such as twitter) from within the browser, then the dataminer extension for Google chrome is useful, along with the autoscroll extension. Below are several useful resources for this extension.

Note You will need to have a google account to use this service. Be aware this may result in DataMiner being aware your activities.

API and Data Dumps

If you want to go further with these techniques then the following might be useful:

There are lots of different packages for collecting social media data. The vosonSML package is a front end to the rtweet and RedditExtractoR packages. It appears to have some excellent network analysis component and is most certainly worth you looking into.

In addition:

  • DMI-TCAT - A software package (not in R) which allows one to download and archive data Twitter data. The data can then be analysed using a web interface. We have several TCAT servers set up at CIM.
  • NetVizz - A Facebook APP for downloading data.
  • YouTube Data Tools - A free app for downloading YouTube data. You can download the data and then load it into R using the command read.csv(filename, sep = '\t'). Where filename is the filename of your downloaded file.

Summarising

The best way to understand these tools is to use them, play with them, break them and produce something. A quick online search reveals some r data sets for free. Loading some of these datasets looks a little tricky.

You can always try FiveThirtyEight data sets - a news organisation which has released all their data in the fivethirtyeight R package

Text Analysis

The Text Mining with R book chose the analysis we use in this session to demonstrate their package. These methods are still used. A nice recent review of the different packages and methods is Text Analysis in R. There are online paid course (for example DataCamp) but I have not taken these and you can learn a lot using free resources.

Machine learning is becoming a popular technique. Models which have been pretrained on large collections of text are used to categorise text. In the next segment, we will dip our toes into this approach. An excellent text is Deep learning with R. I heartily recomend reading through that book if you are interested in machine learning.

I have listed a few more resources below. These can be tricky and you are best going with the book and paper mentioned above.

Natural Language Processing

Natual language processing is an interesting area. There are lots of packages (e.g., see the CRAN natural langauge processing view) and cleanNLP seems to be the best for easily integrating with the tidyverse. You may find the below links to be useful.

  • Stanford CoreNLP - An excellent and classic library for natual language processing.
  • Spacy homepage - The python package we used above. If you applying the above in your own projects then please work through this page to usderstand some of the nuances of the program.
  • CleanNLP article - The paper contains some analysis which we did not do above. Furthermore, the references in the paper are useful for better understanding the approach.
  • Detecting politeness - Natual language processing has the potential of identifying more nuanced components of anlaysis, such as politeness.

In general, you should find some data and play around. Try to get a sense of what your data is and understand what it is telling you.