diff --git a/05-summary.qmd b/04-summary.qmd
similarity index 65%
rename from 05-summary.qmd
rename to 04-summary.qmd
index fb866676..3510b957 100644
--- a/05-summary.qmd
+++ b/04-summary.qmd
@@ -1,5 +1,7 @@
# Data Summaries {#sec-summary}
+THIS CHAPTER IS CURRENTLY UNDERGOING REVISION
+
## Intended Learning Outcomes {#sec-ilo-summary .unnumbered}
* Be able to summarise data by groups
@@ -8,51 +10,19 @@
## Walkthrough video {#sec-walkthrough-summary .unnumbered}
-There is a walkthrough video of this chapter available via [Echo360.](https://echo360.org.uk/media/6783dc21-2603-4579-bc19-5ae971638e85/public) Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.
+There is a walkthrough video of this chapter available via [Echo360](). Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.
## Set-up {#sec-setup-summary}
-First, create a new project for the work we'll do in this chapter named `r path("05-summary")`. Second, download the data for this chapter (ncod_tweets.rds) and save it in your project data folder. Finally, open and save and new R Markdown document named `summary.Rmd`, delete the welcome text and load the required packages for this chapter.
+First, create a new project for the work we'll do in this chapter named `r path("04-summary")`. Second, download the data for this chapter (ncod_tweets.rds) and save it in your project data folder. Finally, open and save and new R Markdown document named `summary.Rmd`, delete the welcome text and load the required packages for this chapter.
```{r setup-summary, message=FALSE, verbatim="r setup, include=FALSE"}
library(tidyverse) # data wrangling functions
-library(rtweet) # for searching tweets
library(kableExtra) # for nice tables
```
Download the [Data transformation cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf).
-## Social media data
-
-In this chapter we're going to analyse social media data, specifically data from Twitter. There are two broad types of data you can obtain from Twitter; data scraped from Twitter using purpose-built packages such as rtweet, and data provided via [Twitter Analytics](https://analytics.twitter.com/) for any accounts for which you have access.
-
-**This chapter was written in late 2021 and a lot of things have changed at Twitter since then. Importantly, starting 9th February 2023, Twitter have announced that access to their API will no longer be free, which means that the search functions below will not work unless you have a paid account. We have provided the data you need for this chapter so it is not a problem for your learning and in version 3 of ADS, we will remove any reliance on Twitter data.**
-
-For this chapter, we'll use data scraped from Twitter using rtweet. In order to use these functions, you need to have a Twitter account. Don't worry if you don't have one; we'll provide the data in the examples below for you.
-
-rtweet has a lot of flexibility, for example, you can search for tweets that contain a certain hashtag or word, tweets by a specific user, or tweets that meet certain conditions like location or whether the user is verified.
-
-For the dataset for this chapter, we used the `search_tweets()` function to find the last 30K tweets with the hashtag [#NationalComingOutDay](https://en.wikipedia.org/wiki/National_Coming_Out_Day). This is mainly interesting around October 11th (the date of National Coming Out Day), so we've provided the relevant data for you that we scraped at that time.
-
-If you have a Twitter account, you can complete this chapter using your own data and any hashtag that interests you. When you run the `search_tweets()` function, you will be asked to sign in to your Twitter account.
-
-```{r, eval = FALSE}
-tweets <- search_tweets(q = "#NationalComingOutDay",
- n = 30000,
- include_rts = FALSE)
-```
-
-### R objects
-
-If you're working with live social media data, every time you run a query it's highly likely you will get a different set of data as new tweets are added. Additionally, the Twitter API places limits on how much data you can download and searches are limited to data from the last 6-9 days. Consequently, it can be useful to save the results of your initial search. `saveRDS` is a useful function that allows you to save any object in your environment to disk.
-
-```{r eval = FALSE}
-saveRDS(tweets, file = "data/ncod_tweets.rds")
-```
-
-After you run `search_tweets()` and save the results, set that code chunk to `eval = FALSE` or comment out that code so your script doesn't run the search and overwrite your saved data every time you knit it.
-
-To load an `.rds` file, you can use the `readRDS()` function. If you don't have access to a Twitter account, or to ensure that you get the same output as the rest of this chapter, you can download ncod_tweets.rds and load it using this function.
```{r}
tweets <- readRDS("data/ncod_tweets.rds")
@@ -180,28 +150,6 @@ tweet_summary <- tweets %>% # start with the object tweets and then
Notice that `summarise()` no longer needs the first argument to be the data table, it is pulled in from the pipe. The power of the pipe may not be obvious now, but it will soon prove its worth.
-### Inline coding
-
-To insert those values into the text of a report you can use inline coding. First. we'll create another set of objects that contain the first and last date of the tweets in our dataset. `format()` formats the dates to day/month/year.
-
-```{r}
-date_from <- tweet_summary$min_date %>%
- format("%d %B, %Y")
-date_to <- tweet_summary$max_date %>%
- format("%d %B, %Y")
-```
-
-Then you can insert values from these objects and the tables you created with `summarise()` using inline R (note the dollar sign notation to get the value of the `n` column from the table `tweet_summary`).
-
-```{verbatim, lang="md"}
-There were `r tweet_summary$n` tweets between `r date_from` and `r date_to`.
-```
-
-Knit your Markdown to see how the variables inside the inline code get replaced by their values.
-
-> There were `r tweet_summary$n` tweets between `r date_from` and `r date_to`.
-
-Ok, let's get back on track.
## Counting
@@ -236,32 +184,6 @@ tweets %>% count(is_quote, is_retweet)
`r mcq1`
:::
-### Inline coding 2
-
-Let's do another example of inline coding that writes up a summary of the most prolific tweeters to demonstrate a few additional functions. First, we need to create some additional objects to use with inline R:
-
-* `nrow()` simply counts the number of rows in a dataset so if you have one user/participant/customer per row, this is an easy way to do a head count.
-* `slice()` chooses a particular row of data, in this case the first row. Because we sorted our data, this will therefore be the user with the most tweets.
-* `pull()` pulls out a single variable.
-* The combination of `slice()` and `pull()` allows you to choose a single observation from a single variable.
-
-```{r}
-unique_users <- nrow(tweets_per_user)
-most_prolific <- slice(tweets_per_user, 1) %>%
- pull(screen_name)
-most_prolific_n <- slice(tweets_per_user, 1) %>%
- pull(n)
-```
-
-Then add the inline code to your report...
-
-```{verbatim, lang="md"}
-There were `r unique_users` unique accounts tweeting about #NationalComingOutDay. `r most_prolific` was the most prolific tweeter, with `r most_prolific_n` tweets.
-```
-
-...and knit your Markdown to see the output:
-
-There were `r unique_users` unique accounts tweeting about #NationalComingOutDay. `r most_prolific` was the most prolific tweeter, with `r most_prolific_n` tweets.
## Grouping {#sec-grouping}
@@ -376,54 +298,6 @@ mcq1 <- c(answer = "`tweets %>% group_by(source) %>% filter(n() >= 10)`",
`r mcq1`
:::
-### Inline coding 3
-
-There's a huge amount of data reported for each tweet, including things like the URLs of the tweets and any media attached to them. This means we can produce output like the below reproducibly and using inline coding.
-
-```{r echo = FALSE}
-orig <- filter(most_fav, !is_quote)
-quote <- filter(most_fav, is_quote)
-```
-
-The most favourited `r orig$favorite_count` original tweet was by [`r orig$screen_name`](`r orig$status_url`):
-
---------------------------------------------------
-
-> `r orig$text`
-
-![](`r orig$ext_media_url`)
-
-------------------------------------------------
-
-To produce this, first we split `most_fav`, so that we have one object that contains the data from the original tweet and one object that contains the data from the quote tweet.
-
-```{r, echo = TRUE}
-orig <- filter(most_fav,is_quote == FALSE)
-quote <- filter(most_fav,is_quote == TRUE)
-```
-
-The inline code is then as follows:
-
-```{verbatim, lang="md"}
-The most favourited `r orig$favorite_count` original tweet was by [`r orig$screen_name`](`r orig$status_url`):
-
---------------------------------------------------
-
-> `r orig$text`
-
-![](`r orig$ext_media_url`)
-
---------------------------------------------------
-```
-
-This is quite complicated so let's break it down.
-
-* The first bit of inline coding is fairly standard and is what you have used before.
-* The second bit of inline coding inserts a URL. The content of the `[]` is the text that will be displayed. The content of `()` is the underlying URL. In both cases, the content is being pulled from the dataset. In this case, the text is `screen_name` and `status_url` links to the tweet.
-* The line of dashes creates the solid line in the knitted output.
-* The `>` symbol changes the format to a block quote.
-* The image is then included using the format `![](url)`, which is an alternative method of including images in Markdown.
-
## Exercises
@@ -445,7 +319,5 @@ glossary_table(as_kable = FALSE) |>
## Further resources {#sec-resources-summary}
* [Data transformation cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf)
-* [Chapter 5: Data Transformation ](http://r4ds.had.co.nz/transform.html) in *R for Data Science*
-* [Intro to rtweet](https://docs.ropensci.org/rtweet/articles/rtweet.html)
-* [Tidy Text](https://www.tidytextmining.com/index.html)
+* [Chapter 5: Data Transformation ](https://r4ds.hadley.nz/data-transform.html) in *R for Data Science*
* [kableExtra vignettes](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html)
diff --git a/06-formative.qmd b/05-formative.qmd
similarity index 100%
rename from 06-formative.qmd
rename to 05-formative.qmd
diff --git a/06-ai.qmd b/06-ai.qmd
new file mode 100644
index 00000000..646c7653
--- /dev/null
+++ b/06-ai.qmd
@@ -0,0 +1,14 @@
+# AI {#sec-ai}
+
+CONTENT COMING SOON!
+
+## Intended Learning Outcomes {#sec-ilo-ai .unnumbered}
+
+* A
+* B
+
+## Walkthrough video {#sec-walkthrough-ai .unnumbered}
+
+There is a walkthrough video of this chapter available via [Echo360](). Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.
+
+## Set-up {#sec-setup-ai}
diff --git a/12-license.qmd b/12-license.qmd
index 717edd4e..49eb10bd 100644
--- a/12-license.qmd
+++ b/12-license.qmd
@@ -8,5 +8,5 @@ Some material from this book was adapted from @reprores2 and @nordmann_2021.
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6365077.svg)](https://doi.org/10.5281/zenodo.6365077)
-Nordmann, E. & DeBruine, L. (2023) Applied Data Skills. v2.0. Retrieved from https://psyteachr.github.io/ads-v2/ doi: [10.5281/zenodo.6365077](https://doi.org/10.5281/zenodo.6365077)
+Nordmann, E. & DeBruine, L. (2023) Applied Data Skills. v3.0. Retrieved from https://psyteachr.github.io/ads-v3/ doi: [10.5281/zenodo.6365077](https://doi.org/10.5281/zenodo.6365077)
diff --git a/CITATION b/CITATION
index 838a8c5b..be71dea5 100644
--- a/CITATION
+++ b/CITATION
@@ -8,6 +8,6 @@ authors:
given-names: "Lisa"
orcid: "https://orcid.org/0000-0002-7523-5539"
title: "Applied Data Skills"
-version: 2.0
-date-released: 2023-01-01
-url: "https://github.com/psyteachr/ads-v2/"
+version: 3.0
+date-released: 2024-03-01
+url: "https://github.com/psyteachr/ads-v3/"
diff --git a/DESCRIPTION b/DESCRIPTION
index 48a56321..73817e75 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,6 +1,6 @@
-Package: ads-v2
+Package: ads-v3
Title: Applied Data Skills
-Version: 2.0
+Version: 3.0
Authors@R:
c(
person(
@@ -52,8 +52,7 @@ Imports:
ggbump,
waffle,
DT,
- showtext,
- spotifyr
+ showtext
Remotes:
psyteachr/glossary,
ropensci/rnaturalearthhires
diff --git a/README.md b/README.md
index 17782956..88de3865 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,16 @@
-# Applied Data Skills (v2)
+# Applied Data Skills (v3)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6365077.svg)](https://doi.org/10.5281/zenodo.6365077)
-The 2023 version of the course book for the Applied Data Skills microcredential course in the School of Psychology and Neuroscience at the University of Glasgow.
+The 2024 version of the course book for the Applied Data Skills microcredential course in the School of Psychology and Neuroscience at the University of Glasgow.
This repository contains the source files for the interactive textbook:
-Nordmann, E. & DeBruine, L. (2023). Applied Data Skills. Version 1.0. Retrieved from https://psyteachr.github.io/ads-v2/. doi: [10.5281/zenodo.6365077](https://doi.org/10.5281/zenodo.6365077)
+Nordmann, E. & DeBruine, L. (2024). Applied Data Skills. Version 3.0. Retrieved from https://psyteachr.github.io/ads-v2/. doi: [10.5281/zenodo.6365077](https://doi.org/10.5281/zenodo.6365077)
-See https://psyteachr.github.io/ads-v1/ for version 1.
+See for version 1 (2022) and for version 2 (2023).
diff --git a/_freeze/04-summary/execute-results/html.json b/_freeze/04-summary/execute-results/html.json
new file mode 100644
index 00000000..9f476a31
--- /dev/null
+++ b/_freeze/04-summary/execute-results/html.json
@@ -0,0 +1,16 @@
+{
+ "hash": "5a645088424fca2f2f92053aec881cfd",
+ "result": {
+ "markdown": "# Data Summaries {#sec-summary}\n\nTHIS CHAPTER IS CURRENTLY UNDERGOING REVISION\n\n## Intended Learning Outcomes {#sec-ilo-summary .unnumbered}\n\n* Be able to summarise data by groups\n* Be able to produce well-formatted tables\n* Use pipes to chain together functions\n\n## Walkthrough video {#sec-walkthrough-summary .unnumbered}\n\nThere is a walkthrough video of this chapter available via [Echo360](). Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.\n\n## Set-up {#sec-setup-summary}\n\nFirst, create a new project for the work we'll do in this chapter named 04-summary. Second, download the data for this chapter (ncod_tweets.rds) and save it in your project data folder. Finally, open and save and new R Markdown document named `summary.Rmd`, delete the welcome text and load the required packages for this chapter.\n\n\n::: {.cell layout-align=\"center\" verbatim='r setup, include=FALSE'}\n
```{r setup, include=FALSE}
\n\n```{.r .cell-code}\nlibrary(tidyverse) # data wrangling functions\nlibrary(kableExtra) # for nice tables\n```\n\n
```
\n:::\n\n\nDownload the [Data transformation cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf).\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntweets <- readRDS(\"data/ncod_tweets.rds\")\n```\n:::\n\n\n\nFirst, run `glimpse(tweets)` or click on the object in the environment to find out what information is in the downloaded data (it's a lot!). Now let's create a series of summary tables and plots with these data.\n\n## Summarise {#sec-summary-summarise}\n\nThe `summarise()` function from the dplyr package is loaded as part of the tidyverse and creates summary statistics. It creates a new table with columns that summarise the data from a larger table using summary functions. Check the [Data Transformation Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf) for various summary functions. Some common ones are: `n()`, `min()`, `max()`, `sum()`, `mean()`, and `quantile()`.\n\n::: {.callout-warning}\nIf you get the answer `NA` from a summary function, that usually means that there are missing values in the columns you were summarising. We'll discuss this more in @sec-missing-values, but you can ignore missing values for many functions by adding the argument `na.rm = TRUE`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nvalues <- c(1, 2, 4, 3, NA, 2)\nmean(values) # is NA\nmean(values, na.rm = TRUE) # removes NAs first\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n[1] 2.4\n```\n:::\n:::\n\n:::\n\nThis function can be used to answer questions like: How many tweets were there? What date range is represented in these data? What are the mean and median number of favourites per tweet? Let's start with a very simple example to calculate the mean, median, min, and max number of favourites (Twitter's version of a \"like\"):\n\n* The first argument that `summarise()` takes is the data table you wish to summarise, in this case the object `tweets`.\n* `summarise()` will create a new table. The column names of this new table will be the left hand-side arguments, i.e., `mean_favs`, `median_favs`, `min_favs` and `max_favs`. \n* The values of these columns are the result of the summary operation on the right hand-side.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfavourite_summary <- summarise(tweets,\n mean_favs = mean(favorite_count),\n median_favs = median(favorite_count),\n min_favs = min(favorite_count),\n max_favs = max(favorite_count))\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n`````{=html}\n
\n \n
\n
mean_favs
\n
median_favs
\n
min_favs
\n
max_favs
\n
\n \n\n
\n
29.71732
\n
3
\n
0
\n
22935
\n
\n\n
\n\n`````\n:::\n:::\n\n\nThe mean number of favourites is substantially higher than the median and the range is huge, suggesting there are ` r glossary(\"outlier\", \"outliers\")`. A quick histogram confirms this - most tweets have few favourites but there are a few with a lot of likes that skew the mean.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tweets, aes(x = favorite_count)) +\n geom_histogram(bins = 25) +\n scale_x_continuous(trans = \"pseudo_log\", \n breaks = c(0, 1, 10, 100, 1000, 10000))\n```\n\n::: {.cell-output-display}\n![](04-summary_files/figure-html/unnamed-chunk-5-1.png){fig-align='center' width=100%}\n:::\n:::\n\n\n::: {.callout-note}\nPlotting the logarithm of a very skewed value can often give you a better idea of what's going on. Use `scale_x_continuous(trans = \"pseudo_log\")` to include zeros on the plot (just \"log\" converts 0 to negative infinity and removes it from the plot).\n:::\n\nYou can add multiple operations to a single call to `summarise()` so let's try a few different operations. The `n()` function counts the number of rows in the data. The `created_at` column gives us the date each tweet were created, so we can use the `min()` and `max()` functions to get the range of dates. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntweet_summary <- tweets %>%\n summarise(mean_favs = mean(favorite_count),\n median_favs = quantile(favorite_count, .5),\n n = n(),\n min_date = min(created_at),\n max_date = max(created_at))\n\nglimpse(tweet_summary)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 1\nColumns: 5\n$ mean_favs 29.71732\n$ median_favs 3\n$ n 28626\n$ min_date 2021-10-10 00:10:02\n$ max_date 2021-10-12 20:12:27\n```\n:::\n:::\n\n\n::: {.callout-note}\nQuantiles are like percentiles. Use `quantile(x, .50)` to find the median (the number where 50% of values in `x` are above it and 50% are below it). This can be useful when you need a value like \"90% of tweets get *X* favourites or fewer\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nquantile(tweets$favorite_count, 0.90)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n90% \n 31 \n```\n:::\n:::\n\n\n:::\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n\n::: {.callout-note .try}\n* How would you find the largest number of retweets?\n \n\n* How would you calculate the mean `display_text_width`? \n \n\n:::\n\n### The $ operator\n\nWe need to take a couple of brief detours to introduce some additional coding conventions. First, let's clear up what that `$` notation is doing. The dollar sign allows you to select items from an object, such as columns from a table. The left-hand side is the object, and the right-hand side is the item. When you call a column like this, R will print all the observations in that column. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntweet_summary$mean_favs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 29.71732\n```\n:::\n:::\n\n\nIf your item has multiple observations, you can specify which ones to return using square brackets `[]` and the row number or a vector of row numbers.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntweets$source[1] # select one observation\ntweets$display_text_width[c(20,30,40)] # select multiple with c()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"Twitter for Android\"\n[1] 78 287 107\n```\n:::\n:::\n\n\n### Pipes {#sec-pipes-first}\n\nFor our second detour, let's formally introduce the pipe, that weird `%>%` symbol we've used occasionally. Pipes allow you to send the output from one function straight into another function. Specifically, they send the result of the function before `%>%` to be the first argument of the function after `%>%`. It can be useful to translate the pipe as \"**and then**\". It's easier to show than tell, so let's look at an example.\n\nWe could write the above code using a pipe as follows:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntweet_summary <- tweets %>% # start with the object tweets and then\n summarise(mean_favs = mean(favorite_count), #summarise it\n median_favs = median(favorite_count))\n```\n:::\n\n\nNotice that `summarise()` no longer needs the first argument to be the data table, it is pulled in from the pipe. The power of the pipe may not be obvious now, but it will soon prove its worth. \n\n\n## Counting\n\nHow many different accounts tweeted using the hashtag? Who tweeted most?\n\nYou can count categorical data with the `count()` function. Since each row is a tweet, you can count the number of rows per each different `screen_name` to get the number of tweets per user. This will give you a new table with each combination of the counted columns and a column called `n` containing the number of observations from that group. \n\nThe argument `sort = TRUE` will sort the table by `n` in descending order, whilst `head()` returns the first six lines of a data table and is a useful function to call when you have a very large dataset and just want to see the top values.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntweets_per_user <- tweets %>%\n count(screen_name, sort = TRUE)\n\nhead(tweets_per_user)\n```\n\n::: {.cell-output-display}\n
\n\n
\n \n
\n
screen_name
\n
n
\n
\n \n\n
\n
interest_outfit
\n
35
\n
\n
\n
LeoShir2
\n
33
\n
\n
\n
NRArchway
\n
32
\n
\n
\n
dr_stack
\n
32
\n
\n
\n
bhavna_95
\n
25
\n
\n
\n
WipeHomophobia
\n
23
\n
\n\n
\n\n
\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n::: {.callout-note .try}\nHow would you create the table of counts below? \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n\n
\n \n
\n
is_quote
\n
is_retweet
\n
n
\n
\n \n\n
\n
FALSE
\n
FALSE
\n
26301
\n
\n
\n
TRUE
\n
FALSE
\n
2325
\n
\n\n
\n\n
\n:::\n:::\n\n\n\n\n:::\n\n\n## Grouping {#sec-grouping}\n\nYou can also create summary values by group. The combination of `group_by()` and `summarise()` is incredibly powerful, and it is also a good demonstration of why pipes are so useful.\n\nThe function `group_by()` takes an existing data table and converts it into a grouped table, where any operations that are performed on it are done \"by group\".\n\nThe first line of code creates an object named `tweets_grouped`, that groups the dataset according to whether the user is a verified user. On the surface, `tweets_grouped` doesn't look any different to the original `tweets`. However, the underlying structure has changed and so when we run `summarise()`, we now get our requested summaries for each group (in this case verified or not). \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntweets_grouped <- tweets %>%\n group_by(verified)\n\nverified <- tweets_grouped %>%\n summarise(count = n(),\n mean_favs = mean(favorite_count),\n mean_retweets = mean(retweet_count)) %>%\n ungroup()\n\nverified\n```\n\n::: {.cell-output-display}\n
\n\n
\n \n
\n
verified
\n
count
\n
mean_favs
\n
mean_retweets
\n
\n \n\n
\n
FALSE
\n
26676
\n
18.40576
\n
1.825649
\n
\n
\n
TRUE
\n
1950
\n
184.45949
\n
21.511282
\n
\n\n
\n\n
\n:::\n:::\n\n\n::: {.callout-warning}\nMake sure you call the `ungroup()` function when you are done with grouped functions. Failing to do this can cause all sorts of mysterious problems if you use that data table later assuming it isn't grouped.\n:::\n\nWhilst the above code is functional, it adds an unnecessary object to the environment - `tweets_grouped` is taking up space and increases the risk we'll use this grouped object by mistake. Enter... the pipe.\n\nRather than creating an intermediate object, we can use the pipe to string our code together.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nverified <- \n tweets %>% # Start with the original dataset; and then\n group_by(verified) %>% # group it; and then\n summarise(count = n(), # summarise it by those groups\n mean_favs = mean(favorite_count),\n mean_retweets = mean(retweet_count)) %>%\n ungroup()\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n::: {.callout-note .try}\n* What would you change to calculate the mean favourites and retweets by `screen_name` instead of by `verified`? \n \n\n:::\n\n### Multiple groupings\n\nYou can add multiple variables to `group_by()` to further break down your data. For example, the below gives us the number of likes and retweets broken down by verified status and the device the person was tweeting from (`source`). \n\n* Reverse the order of `verified` and `source` in `group_by()` to see how it changed the output.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nverified_source <- tweets %>%\n group_by(verified, source) %>%\n summarise(count = n(),\n mean_favs = mean(favorite_count),\n mean_retweets = mean(retweet_count)) %>%\n ungroup() %>%\n arrange(desc(count))\n\nhead(verified_source)\n```\n\n::: {.cell-output-display}\n
\n\n
\n \n
\n
verified
\n
source
\n
count
\n
mean_favs
\n
mean_retweets
\n
\n \n\n
\n
FALSE
\n
Twitter for iPhone
\n
12943
\n
25.40493
\n
2.304643
\n
\n
\n
FALSE
\n
Twitter for Android
\n
5807
\n
11.90839
\n
1.155846
\n
\n
\n
FALSE
\n
Twitter Web App
\n
5795
\n
13.54737
\n
1.611217
\n
\n
\n
TRUE
\n
Twitter for iPhone
\n
691
\n
323.24457
\n
29.010130
\n
\n
\n
TRUE
\n
Twitter Web App
\n
560
\n
131.44643
\n
21.717857
\n
\n
\n
FALSE
\n
Twitter for iPad
\n
374
\n
13.85027
\n
2.042781
\n
\n\n
\n\n
\n:::\n:::\n\n\n::: {.callout-warning}\nYou may get the following message when using `summarise()` after `group_by()`.\n\n> `summarise()` has grouped output by 'verified'. You can override using the `.groups` argument.\n\nTidyverse recently added a message to remind you whether the `summarise()` function automatically ungroups grouped data or not (it may do different things depending on how it's used). You can set the argument `.groups` to \"drop\", \"drop_last\", \"keep\", or \"rowwise\" (see the help for `?summarise`), but it's good practice to explicitly use `ungroup()` when you're done working by groups, regardless. \n:::\n\n### Filter and mutate\n\nYou can also use additional functions like `filter()` or `mutate()` after `group_by`. You'll learn more about these in @sec-wrangle but briefly:\n\n* `filter()` keeps observations (rows) according to specified criteria, e.g., all values above 5, or all verified users.\n* `mutate()` creates new variables (columns), or overwrites existing ones.\n\nYou can combine functions like this to get detailed insights into your data. For example, what were the most favourited original and quoted tweets? \n\n* The variable `is_quote` tells us whether the tweet in question was an original tweet or a quote tweet. Because we want our output to treat these separately, we pass this variable to `group_by()`. \n* We want the most favourited tweets, i.e., the maximum value of `favourite_count`, so we can use `filter()` to only return rows where `favourite_count` is equal to the maximum value in the variable `favourite_count`. Note the use of `==` rather than a single `=`.\n* Just in case there was a tie, choose a random one with `sample_n(size = 1)`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmost_fav <- tweets %>%\n group_by(is_quote) %>%\n filter(favorite_count == max(favorite_count)) %>%\n sample_n(size = 1) %>%\n ungroup()\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n::: {.callout-note .try}\n* How would you limit the results to sources with 10 or more rows?\n \n\n:::\n\n\n## Exercises\n\nThat was an intensive chapter! Take a break and then try one (or more) of the following and post your knitted HTML files on Teams so that other learners on the course can see what you did.\n\n* If you have your own Twitter account, conduct a similar analysis of a different hashtag.\n* Look through the rest of the variables in `tweets`; what other insights can you generate about this data?\n* Read through the [kableExtra](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html) vignettes and apply your own preferred table style.\n* Work through the first few chapters of [Tidy Text](https://www.tidytextmining.com/index.html){target=\"_blank\"} to see how you can work with and analyse text. In particular, see if you can conduct a sentiment analysis on the tweet data.\n\n## Glossary {#sec-glossary-summary}\n\n\n::: {.cell layout-align=\"center\"}\n
\n \n
\n
term
\n
definition
\n
\n \n\n
\n
mean
\n
\n
\n
\n
median
\n
\n
\n
\n
pipe
\n
\n
\n
\n
quantile
\n
\n
\n
\n
vector
\n
\n
\n\n
\n:::\n\n\n## Further resources {#sec-resources-summary}\n\n* [Data transformation cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf)\n* [Chapter 5: Data Transformation ](https://r4ds.hadley.nz/data-transform.html) in *R for Data Science*\n* [kableExtra vignettes](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html)\n",
+ "supporting": [
+ "04-summary_files"
+ ],
+ "filters": [
+ "rmarkdown/pagebreak.lua"
+ ],
+ "includes": {},
+ "engineDependencies": {},
+ "preserve": {},
+ "postProcess": true
+ }
+}
\ No newline at end of file
diff --git a/docs/05-summary_files/figure-html/unnamed-chunk-7-1.png b/_freeze/04-summary/figure-html/unnamed-chunk-5-1.png
similarity index 100%
rename from docs/05-summary_files/figure-html/unnamed-chunk-7-1.png
rename to _freeze/04-summary/figure-html/unnamed-chunk-5-1.png
diff --git a/_freeze/app-import/execute-results/html.json b/_freeze/app-import/execute-results/html.json
new file mode 100644
index 00000000..578b7d03
--- /dev/null
+++ b/_freeze/app-import/execute-results/html.json
@@ -0,0 +1,16 @@
+{
+ "hash": "0e9878bc47de60b639e5482038e8e976",
+ "result": {
+ "markdown": "# Data Import {#sec-data}\n\nTHIS APPENDIX IS BEING UPDATED...\n\n## Intended Learning Outcomes {#sec-ilo-data .unnumbered}\n\n* Be able to inspect data\n* Be able to import data from a range of sources\n* Be able to identify and handle common problems with data import\n\n## Walkthrough video {#sec-walkthrough-data .unnumbered}\n\nThere is a walkthrough video of this chapter available via [Echo360.](https://echo360.org.uk/media/52d2249e-a737-42b4-bf55-267e39fc05c5/public) Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.\n\n## Set-up {#sec-setup-data}\n\nCreate a new project for the work we'll do in this chapter named 04-data. Then, create and save a new R Markdown document named `data.Rmd`, get rid of the default template text, and load the packages in the set-up code chunk. You should have all of these packages installed already, but if you get the message `Error in library(x) : there is no package called ‘x’`, please refer to @sec-install-package.\n\n\n::: {.cell layout-align=\"center\" verbatim='r setup, include=FALSE'}\n
```{r setup, include=FALSE}
\n\n```{.r .cell-code}\nlibrary(tidyverse) # includes readr & tibble\nlibrary(rio) # for almost any data import/export\nlibrary(haven) # for SPSS, Stata,and SAS files\nlibrary(readxl) # for Excel files\nlibrary(googlesheets4) # for Google Sheets\n```\n\n
```
\n:::\n\n\nWe'd recommend making a new code chunk for each different activity, and using the white space to make notes on any errors you make, things you find interesting, or questions you'd like to ask the course team.\n\nDownload the [Data import cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf).\n\n## Built-in data {#sec-builtin}\n\nYou'll likely want to import you own data to work with, however, Base R also comes with built-in datasets and these can be very useful for learning new functions and packages. Additionally, some packages, like tidyr, also contain data. The `data()` function lists the datasets available.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# list datasets built in to base R\ndata()\n\n# lists datasets in a specific package\ndata(package = \"tidyr\")\n```\n:::\n\n\nType the name of a dataset into the console to see the data. For example, type `?table1` into the console to see the dataset description for `table1`, which is a dataset included with tidyr.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n?table1\n```\n:::\n\n\nYou can also use the `data()` function to load a dataset into your global environment.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# loads table1 into the environment\ndata(\"table1\")\n```\n:::\n\n\n\n## Looking at data\n\nNow that you've loaded some data, look the upper right hand window of RStudio, under the Environment tab. You will see the object `table1` listed, along with the number of observations (rows) and variables (columns). This is your first check that everything went OK.\n\n**Always, always, always, look at your data once you've created or loaded a table**. Also look at it after each step that transforms your table. There are three main ways to look at your table: `View()`, `print()`, `tibble::glimpse()`. \n\n### View() \n\nA familiar way to look at the table is given by `View()` (uppercase 'V'), which opens up a data table in the console pane using a viewer that looks a bit like Excel. This command can be useful in the console, but don't ever put this one in a script because it will create an annoying pop-up window when the user runs it. You can also click on an object in the environment pane to open it in the same interface. You can close the tab when you're done looking at it; it won't remove the object.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nView(table1)\n```\n:::\n\n\n\n### print() \n\nThe `print()` method can be run explicitly, but is more commonly called by just typing the variable name on a blank line. The default is not to print the entire table, but just the first 10 rows. \n\nLet's look at the `table1` table that we loaded above. Depending on how wide your screen is, you might need to click on an arrow at the right of the table to see the last column. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# call print explicitly\nprint(table1)\n\n# more common method of just calling object name\ntable1\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n:::\n:::\n\n\n### glimpse() \n\nThe function `tibble::glimpse()` gives a sideways version of the table. This is useful if the table is very wide and you can't easily see all of the columns. It also tells you the data type of each column in angled brackets after each column name. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(table1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 6\nColumns: 4\n$ country \"Afghanistan\", \"Afghanistan\", \"Brazil\", \"Brazil\", \"China\", …\n$ year 1999, 2000, 1999, 2000, 1999, 2000\n$ cases 745, 2666, 37737, 80488, 212258, 213766\n$ population 19987071, 20595360, 172006362, 174504898, 1272915272, 12804…\n```\n:::\n:::\n\n\n### summary() {#sec-summary-function}\n\nYou can get a quick summary of a dataset with the `summary()` function, which can be useful for spotting things like if the minimum or maximum values are clearly wrong, or if R thinks that a nominal variable is numeric. For example, if you had labelled gender as 1, 2, and 3 rather than male, female, and non-binary, `summary()` would calculate a mean and median even though this isn't appropriate for the data. This can be a useful flag that you need to take further steps to correct your data. \n\nNote that because `population` is a very, very large number, R will use [scientific notation](https://courses.lumenlearning.com/waymakerintermediatealgebra/chapter/read-writing-scientific-notation-2/). \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary(table1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n country year cases population \n Length:6 Min. :1999 Min. : 745 Min. :1.999e+07 \n Class :character 1st Qu.:1999 1st Qu.: 11434 1st Qu.:5.845e+07 \n Mode :character Median :2000 Median : 59112 Median :1.733e+08 \n Mean :2000 Mean : 91277 Mean :4.901e+08 \n 3rd Qu.:2000 3rd Qu.:179316 3rd Qu.:9.983e+08 \n Max. :2000 Max. :213766 Max. :1.280e+09 \n```\n:::\n:::\n\n\n\n## Importing data {#sec-import_data}\n\nBuilt-in data are nice for examples, but you're probably more interested in your own data. There are many different types of files that you might work with when doing data analysis. These different file types are usually distinguished by the three-letter extension following a period at the end of the file name (e.g., `.xls`). \n\nDownload this [directory of data files](data/data.zip), unzip the folder, and save the `data` directory in the `04-data` project directory.\n\n\n\n\n\n\n### rio::import() \n\nThe type of data files you have to work with will likely depend on the software that you typically use in your workflow. The rio package has very straightforward functions for reading and saving data in most common formats: `rio::import()` and `rio::export()`. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_tsv <- import(\"data/demo.tsv\") # tab-separated values\ndemo_csv <- import(\"data/demo.csv\") # comma-separated values\ndemo_xls <- import(\"data/demo.xlsx\") # Excel format\ndemo_sav <- import(\"data/demo.sav\") # SPSS format\n```\n:::\n\n\n\n### File type specific import \n\nHowever, it is also useful to know the specific functions that are used to import different file types because it is easier to discover features to deal with complicated cases, such as when you need to skip rows, rename columns, or choose which Excel sheet to use.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_tsv <- readr::read_tsv(\"data/demo.tsv\")\ndemo_csv <- readr::read_csv(\"data/demo.csv\")\ndemo_xls <- readxl::read_excel(\"data/demo.xlsx\")\ndemo_sav <- haven::read_sav(\"data/demo.sav\")\n```\n:::\n\n\n::: {.callout-note .try}\nLook at the help for each function above and read through the Arguments section to see how you can customise import.\n:::\n\nIf you keep data in Google Sheets, you can access it directly from R using `googlesheets4\", \"https://googlesheets4.tidyverse.org/\")`. The code below imports data from a [public sheet](https://docs.google.com/spreadsheets/d/16dkq0YL0J7fyAwT1pdgj1bNNrheckAU_2-DKuuM6aGI){target=\"_blank\"}. You can set the `ss` argument to the entire URL for the target sheet, or just the section after \"https://docs.google.com/spreadsheets/d/\".\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngs4_deauth() # skip authorisation for public data\n\ndemo_gs4 <- googlesheets4::read_sheet(\n ss = \"16dkq0YL0J7fyAwT1pdgj1bNNrheckAU_2-DKuuM6aGI\"\n)\n```\n:::\n\n\n\n### Column data types {#sec-col_types}\n\nUse `glimpse()` to see how these different functions imported the data with slightly different data types. This is because the different file types store data slightly differently. For example, SPSS stores factors as numbers, so the `factor` column contains the values 1, 2, 3 rather than `low`, `med`, `high`. It also stores logical values as 0 and 1 instead or TRUE and FALSE.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(demo_csv)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 6\nColumns: 6\n$ character \"A\", \"B\", \"C\", \"D\", \"E\", \"F\"\n$ factor \"high\", \"low\", \"med\", \"high\", \"low\", \"med\"\n$ integer 1, 2, 3, 4, 5, 6\n$ double 1.5, 2.5, 3.5, 4.5, 5.5, 6.5\n$ logical TRUE, TRUE, FALSE, FALSE, NA, TRUE\n$ date 2024-02-15, 2024-02-14, 2024-02-13, 2024-02-12, 2024-02-11, …\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(demo_xls)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 6\nColumns: 6\n$ character \"A\", \"B\", \"C\", \"D\", \"E\", \"F\"\n$ factor \"high\", \"low\", \"med\", \"high\", \"low\", \"med\"\n$ integer 1, 2, 3, 4, 5, 6\n$ double 1.5, 2.5, 3.5, 4.5, 5.5, 6.5\n$ logical TRUE, TRUE, FALSE, FALSE, NA, TRUE\n$ date 2024-02-15, 2024-02-14, 2024-02-13, 2024-02-12, 2024-02-11, …\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(demo_sav)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 6\nColumns: 6\n$ character \"A\", \"B\", \"C\", \"D\", \"E\", \"F\"\n$ factor 3, 1, 2, 3, 1, 2\n$ integer 1, 2, 3, 4, 5, 6\n$ double 1.5, 2.5, 3.5, 4.5, 5.5, 6.5\n$ logical 1, 1, 0, 0, NA, 1\n$ date 2024-02-15, 2024-02-14, 2024-02-13, 2024-02-12, 2024-02-11, …\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nglimpse(demo_gs4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 6\nColumns: 6\n$ character \"A\", \"B\", \"C\", \"D\", \"E\", \"F\"\n$ factor \"high\", \"low\", \"med\", \"high\", \"low\", \"med\"\n$ integer 1, 2, 3, 4, 5, 6\n$ double 1.5, 2.5, 3.5, 4.5, 5.5, 6.5\n$ logical TRUE, TRUE, FALSE, FALSE, NA, TRUE\n$ date 2021-11-22, 2021-11-21, 2021-11-20, 2021-11-19, 2021-11-18, …\n```\n:::\n:::\n\n\nThe readr functions display a message when you import data explaining what data type each column is.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo <- readr::read_csv(\"data/demo.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 6 Columns: 6\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (2): character, factor\ndbl (2): integer, double\nlgl (1): logical\ndate (1): date\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\nThe \"Column specification\" tells you which data type each column is. You can review data types in @sec-data-types. Options are:\n\n* `chr`: character\n* `dbl`: double\n* `lgl`: logical\n* `int`: integer\n* `date`: date\n* `dttm`: date/time\n\n`read_csv()` will guess what type of data each variable is and normally it is pretty good at this. However, if it makes a mistake, such as reading the \"date\" column as a character, you can manually set the column data types. \n\nFirst, run `spec()` on the dataset which will give you the full column specification that you can copy and paste:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nspec(demo)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ncols(\n character = col_character(),\n factor = col_character(),\n integer = col_double(),\n double = col_double(),\n logical = col_logical(),\n date = col_date(format = \"\")\n)\n```\n:::\n:::\n\n\nThen, we create an object using the code we just copied that lists the correct column types. Factor columns will always import as character data types, so you have to set their data type manually with `col_factor()` and set the order of levels with the `levels` argument. Otherwise, the order defaults to the order they appear in the dataset. For our `demo` dataset, we will tell R that the `factor` variable is a factor by using `col_factor()` and we can also specify the order of the levels so that they don't just appear alphabetically. Additionally, we can also specify exactly what format our `date` variable is in using `%Y-%m-%d`.\n\nWe then save this column specification to an object, and then add this to the `col_types` argument when we call `read_csv()`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncorrected_cols <- cols(\n character = col_character(),\n factor = col_factor(levels = c(\"low\", \"med\", \"high\")),\n integer = col_integer(),\n double = col_double(),\n logical = col_logical(),\n date = col_date(format = \"%Y-%m-%d\")\n)\n\ndemo <- readr::read_csv(\"data/demo.csv\", col_types = corrected_cols)\n```\n:::\n\n\n::: {.callout-note}\nFor dates, you might need to set the format your dates are in. See `?strptime` for a list of the codes used to represent different date formats. For example, `\"%d-%b-%y\"` means that the dates are formatted like `31-Jan-21`. \n:::\n\nThe functions from readxl for loading `.xlsx` sheets have a different, more limited way to specify the column types. You will have to convert factor columns and dates using `mutate()`, which you'll learn about in @sec-wrangle, so most people let `read_excel()` guess data types and don't set the `col_types` argument.\n\nFor SPSS data, whilst `rio::import()` will just read the numeric values of factors and not their labels, the function `read_sav()` from haven reads both. However, you have to convert factors from a haven-specific \"labelled double\" to a factor (we have no idea why haven doesn't do this for you).\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndemo_sav$factor <- haven::as_factor(demo_sav$factor)\n\nglimpse(demo_sav)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 6\nColumns: 6\n$ character \"A\", \"B\", \"C\", \"D\", \"E\", \"F\"\n$ factor high, low, med, high, low, med\n$ integer 1, 2, 3, 4, 5, 6\n$ double 1.5, 2.5, 3.5, 4.5, 5.5, 6.5\n$ logical 1, 1, 0, 0, NA, 1\n$ date 2024-02-15, 2024-02-14, 2024-02-13, 2024-02-12, 2024-02-11, …\n```\n:::\n:::\n\n\n\n::: {.callout-note}\nThe way you specify column types for googlesheets4 is a little different from readr, although you can also use the shortcodes described in the help for `read_sheet()` with readr functions. There is currently no column specification for factors.\n:::\n\n## Creating data \n\nIf you need to create a small data table from scratch in R, use the `tibble::tibble()` function, and type the data right in. The `tibble` package is part of the tidyverse package that we loaded at the start of this chapter. \n\nLet's create a small table with the names of three [Avatar](https://en.wikipedia.org/wiki/Avatar:_The_Last_Airbender) characters and their bending type. The `tibble()` function takes arguments with the names that you want your columns to have. The values are vectors that list the column values in order.\n\nIf you don't know the value for one of the cells, you can enter NA, which we have to do for Sokka because he doesn't have any bending ability. If all the values in the column are the same, you can just enter one value and it will be copied for each row.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\navatar <- tibble(\n name = c(\"Katara\", \"Toph\", \"Sokka\"),\n bends = c(\"water\", \"earth\", NA),\n friendly = TRUE\n)\n\n# print it\navatar\n```\n\n::: {.cell-output-display}\n
\n:::\n:::\n\n\nYou can also use the `tibble::tribble()` function to create a table by row, rather than by column. You start by listing the column names, each preceded by a tilde (`~`), then you list the values for each column, row by row, separated by commas (don't forget a comma at the end of each row).\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\navatar_by_row <- tribble(\n ~name, ~bends, ~friendly,\n \"Katara\", \"water\", TRUE,\n \"Toph\", \"earth\", TRUE,\n \"Sokka\", NA, TRUE\n)\n```\n:::\n\n\n::: {.callout-note}\nYou don't have to line up the columns in a tribble, but it can make it easier to spot errors.\n:::\n\nYou may not need to do this very often if you are primarily working with data that you import from spreadsheets, but it is useful to know how to do it anyway.\n\n## Writing data\n\nIf you have data that you want to save, use `rio::export()`, as follows.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nexport(avatar, \"data/avatar.csv\")\n```\n:::\n\n\nThis will save the data in CSV format to your working directory.\n\nWriting to Google Sheets is a little trickier (if you never use Google Sheets feel free to skip this section). Even if a Google Sheet is publicly editable, you can't add data to it without authorising your account. \n\nYou can authorise interactively using the following code (and your own email), which will prompt you to authorise \"Tidyverse API Packages\" the first time you do this. If you don't tick the checkbox authorising it to \"See, edit, create, and delete all your Google Sheets spreadsheets\", the next steps will fail.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# authorise your account \n# this only needs to be done once per script\ngs4_auth(email = \"myemail@gmail.com\")\n\n# create a new sheet\nsheet_id <- gs4_create(name = \"demo-file\", \n sheets = \"letters\")\n\n# define the data table to save\nletter_data <- tibble(\n character = LETTERS[1:5],\n integer = 1:5,\n double = c(1.1, 2.2, 3.3, 4.4, 5.5),\n logical = c(T, F, T, F, T),\n date = lubridate::today()\n)\n\nwrite_sheet(data = letter_data, \n ss = sheet_id, \n sheet = \"letters\")\n\n## append some data\nnew_data <- tibble(\n character = \"F\",\n integer = 6L,\n double = 6.6,\n logical = FALSE,\n date = lubridate::today()\n)\nsheet_append(data = new_data,\n ss = sheet_id,\n sheet = \"letters\")\n\n# read the data\ndemo <- read_sheet(ss = sheet_id, sheet = \"letters\")\n```\n:::\n\n\n\n::: {.callout-note .try}\n* Create a new table called `family` with the first name, last name, and age of your family members (biological, adopted, or chosen). \n* Save it to a CSV file called \"family.csv\". \n* Clear the object from your environment by restarting R or with the code `remove(family)`.\n* Load the data back in and view it.\n\n\n::: {.cell layout-align=\"center\" webex.hide='Solution'}\n::: {.callout-note collapse='true'}\n## Solution\n\n```{.r .cell-code}\n# create the table\nfamily <- tribble(\n ~first_name, ~last_name, ~age,\n \"Lisa\", \"DeBruine\", 45,\n \"Robbie\", \"Jones\", 14\n)\n\n# save the data to CSV\nexport(family, \"data/family.csv\")\n\n# remove the object from the environment\nremove(family)\n\n# load the data\nfamily <- import(\"data/family.csv\")\n```\n\n:::\n:::\n\n:::\n\nWe'll be working with tabular data a lot in this class, but tabular data is made up of vectors, which groups together data with the same basic data type. @sec-data-types explains some of this terminology to help you understand the functions we'll be learning to process and analyse data.\n\n\n## Troubleshooting\n\nWhat if you import some data and it guesses the wrong column type? The most common reason is that a numeric column has some non-numbers in it somewhere. Maybe someone wrote a note in an otherwise numeric column. Columns have to be all one data type, so if there are any characters, the whole column is converted to character strings, and numbers like `1.2` get represented as `\"1.2\"`, which will cause very weird errors like `\"100\" < \"9\" == TRUE`. You can catch this by using `glimpse()` to check your data.\n\nThe data directory you downloaded contains a file called \"mess.csv\". Let's try loading this dataset.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmess <- rio::import(\"data/mess.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in (function (input = \"\", file = NULL, text = NULL, cmd = NULL, :\nStopped early on line 5. Expected 7 fields but found 0. Consider fill=TRUE and\ncomment.char=. First discarded non-empty line: <>\n```\n:::\n:::\n\n\nWhen importing goes wrong, it's often easier to fix it using the specific importing function for that file type (e.g., use `read_csv()` rather than `rio::import()`. This is because the problems tend to be specific to the file format and you can look up the help for these functions more easily. For CSV files, the import function is `readr::read_csv`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# lazy = FALSE loads the data right away so you can see error messages\n# this default changed in late 2021 and might change back soon\nmess <- read_csv(\"data/mess.csv\", lazy = FALSE)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: One or more parsing issues, call `problems()` on your data frame for details,\ne.g.:\n dat <- vroom(...)\n problems(dat)\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 27 Columns: 1\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (1): This is my messy dataset\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\nYou'll get a warning about parsing issues and the data table is just a single column. View the file `data/mess.csv` by clicking on it in the File pane, and choosing \"View File\". Here are the first 10 lines. What went wrong?\n\n
\n\nFirst, the file starts with a note: \"This is my messy dataset\" and then a blank line. The first line of data should be the column headings, so we want to skip the first two lines. You can do this with the argument `skip` in `read_csv()`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmess <- read_csv(\"data/mess.csv\", \n skip = 2,\n lazy = FALSE)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 26 Columns: 7\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (6): junk, order, letter, good, min_max, date\ndbl (1): score\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n```{.r .cell-code}\nglimpse(mess)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 26\nColumns: 7\n$ junk \"junk\", \"junk\", \"junk\", \"junk\", \"junk\", \"junk\", \"junk\", \"junk\"…\n$ order \"1\", \"missing\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\", \"10\", \"11\",…\n$ score -1.00, 0.72, -0.62, 2.03, NA, 0.99, 0.03, 0.67, 0.57, 0.90, -1…\n$ letter \"a\", \"b\", \"c\", \"d\", \"e\", \"f\", \"g\", \"h\", \"i\", \"j\", \"k\", \"l\", \"m…\n$ good \"1\", \"1\", \"FALSE\", \"T\", \"1\", \"0\", \"T\", \"TRUE\", \"1\", \"T\", \"F\", …\n$ min_max \"1 - 2\", \"2 - 3\", \"3 - 4\", \"4 - 5\", \"5 - 6\", \"6 - 7\", \"7 - 8\",…\n$ date \"2020-01-1\", \"2020-01-2\", \"2020-01-3\", \"2020-01-4\", \"2020-01-5…\n```\n:::\n:::\n\n\nOK, that's a little better, but this table is still a serious mess in several ways:\n\n* `junk` is a column that we don't need\n* `order` should be an integer column\n* `good` should be a logical column\n* `good` uses all kinds of different ways to record TRUE and FALSE values\n* `min_max` contains two pieces of numeric information, but is a character column\n* `date` should be a date column\n\nWe'll learn how to deal with this mess in @sec-tidy and @sec-wrangle, but we can fix a few things by setting the `col_types` argument in `read_csv()` to specify the column types for our two columns that were guessed wrong and skip the \"junk\" column. The argument `col_types` takes a list where the name of each item in the list is a column name and the value is from the table below. You can use the function, like `col_double()` or the abbreviation, like `\"d\"`; for consistency with earlier in this chapter we will use the function names. Omitted column names are guessed.\n\n| function | |abbreviation | type |\n|:---------|:--------------|:-----|\n| col_logical() | l | logical values |\n| col_integer() | i | integer values |\n| col_double() | d | numeric values |\n| col_character() | c | strings |\n| col_factor(levels, ordered) | f | a fixed set of values |\n| col_date(format = \"\") | D | with the locale's date_format |\n| col_time(format = \"\") | t | with the locale's time_format |\n| col_datetime(format = \"\") | T | ISO8601 date time |\n| col_number() | n | numbers containing the grouping_mark |\n| col_skip() | _, - | don't import this column |\n| col_guess() | ? | parse using the \"best\" type based on the input |\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# omitted values are guessed\n# ?col_date for format options\nct <- cols(\n junk = col_skip(), # skip this column\n order = col_integer(),\n good = col_logical(),\n date = col_date(format = \"%Y-%m-%d\")\n)\n\ntidier <- read_csv(\"data/mess.csv\", \n skip = 2,\n col_types = ct,\n lazy = FALSE)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: One or more parsing issues, call `problems()` on your data frame for details,\ne.g.:\n dat <- vroom(...)\n problems(dat)\n```\n:::\n:::\n\n\nYou will get a message about parsing issues when you run this that tells you to run the `problems()` function to see a table of the problems. Warnings look scary at first, but always start by reading the message.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nproblems()\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n:::\n:::\n\n\n\nThe output of `problems()` tells you what row (3) and column (2) the error was found in, what kind of data was expected (an integer), and what the actual value was (missing). If you specifically tell `read_csv()` to import a column as an integer, any characters (i.e., not numbers) in the column will produce a warning like this and then be recorded as `NA`. You can manually set what missing values are recorded as with the `na` argument.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntidiest <- read_csv(\"data/mess.csv\", \n skip = 2,\n na = \"missing\",\n col_types = ct,\n lazy = FALSE)\n```\n:::\n\n\n\nNow `order` is an integer variable where any empty cells contain `NA`. The variable `good` is a logical value, where `0` and `F` are converted to `FALSE`, while `1` and `T` are converted to `TRUE`. The variable `date` is a date type (adding leading zeros to the day). We'll learn in later chapters how to fix other problems, such as the `min_max` column containing two different types of data.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n
\n:::\n:::\n\n\n\n## Working with real data\n\nIt's worth highlighting at this point that working with real data can be difficult because each dataset can be messy in its own way. Throughout this course we will show you common errors and how to fix them, but be prepared that when you start with working your own data, you'll likely come across problems we don't cover in the course and that's just part of joy of learning programming. You'll also get better at looking up solutions using sites like [Stack Overflow](https://stackoverflow.com/) and there's a fantastic [#rstats](https://twitter.com/hashtag/rstats) community on Twitter you can ask for help.\n\nYou may also be tempted to fix messy datasets by, for example, opening up Excel and editing them there. Whilst this might seem easier in the short term, there's two serious issues with doing this. First, you will likely work with datasets that have recurring messy problems. By taking the time to solve these problems with code, you can apply the same solutions to a large number of future datasets so it's more efficient in the long run. Second, if you edit the spreadsheet, there's no record of what you did. By solving these problems with code, you do so reproducibly and you don't edit the original data file. This means that if you make an error, you haven't lost the original data and can recover.\n\n## Exercises\n\nFor the final step in this chapter, we will create a report using one of the in-built datasets to practice the skills you have used so far. You may need to refer back to previous chapters to help you complete these exercises and you may also want to take a break before you work through this section. We'd also recommend you knit at every step so that you can see how your output changes.\n\n### New Markdown {#sec-exercises-new-rmd-4}\n\nCreate and save a new R Markdown document named `starwars_report.Rmd`. In the set-up code chunk load the packages `tidyverse` and `rio`.\n\nWe're going to use the built-in `starwars` dataset that contains data about Star Wars characters. You can learn more about the dataset by using the `?help` function.\n\n### Import and export the dataset {#sec-exercises-load}\n\n* First, load the in-built dataset into the environment. Type and run the code to do this in the console; do not save it in your Markdown. \n* Then, export the dataset to a .csv file and save it in your `data` directory. Again, do this in the console.\n* Finally, import this version of the dataset using `read_csv()` to an object named `starwars` - you can put this code in your Markdown.\n\n\n
\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(starwars)\nexport(starwars, \"data/starwars.csv\")\nstarwars <- read_csv(\"data/starwars.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 87 Columns: 14\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (11): name, hair_color, skin_color, eye_color, sex, gender, homeworld, s...\ndbl (3): height, mass, birth_year\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n:::\n\n\n\n
\n\n\n### Convert column types\n\n* Check the column specification of `starwars`.\n* Create a new column specification that lists the following columns as factors: `hair_color`, `skin_color`, `eye_color`, `sex`, `gender`, `homeworld`, and `species` and skips the following columns: `films`, `vehicles`, and `starships` (this is because these columns contain multiple values and are stored as lists, which we haven't covered how to work with). You do not have to set the factor orders (although you can if you wish).\n* Re-import the dataset, this time with the corrected column types.\n\n\n
\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nspec(starwars)\ncorrected_cols <- cols(\n name = col_character(),\n height = col_double(),\n mass = col_double(),\n hair_color = col_factor(),\n skin_color = col_factor(),\n eye_color = col_factor(),\n birth_year = col_double(),\n sex = col_factor(),\n gender = col_factor(),\n homeworld = col_factor(),\n species = col_factor(),\n films = col_skip(),\n vehicles = col_skip(),\n starships = col_skip()\n)\n\nstarwars <- read_csv(\"data/starwars.csv\", col_types = corrected_cols)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ncols(\n name = col_character(),\n height = col_double(),\n mass = col_double(),\n hair_color = col_character(),\n skin_color = col_character(),\n eye_color = col_character(),\n birth_year = col_double(),\n sex = col_character(),\n gender = col_character(),\n homeworld = col_character(),\n species = col_character(),\n films = col_character(),\n vehicles = col_character(),\n starships = col_character()\n)\n```\n:::\n:::\n\n\n\n
\n\n\n### Plots {#sec-exercises-plots}\n\nProduce the following plots and one plot of your own choosing. Write a brief summary of what each plot shows and any conclusions you might reach from the data. \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](app-import_files/figure-html/unnamed-chunk-25-1.png){fig-align='center' width=100%}\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](app-import_files/figure-html/unnamed-chunk-26-1.png){fig-align='center' width=100%}\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](app-import_files/figure-html/unnamed-chunk-27-1.png){fig-align='center' width=100%}\n:::\n:::\n\n\n\n
\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(starwars, aes(height)) +\n geom_histogram(binwidth = 25, colour = \"black\", alpha = .3) +\n scale_x_continuous(breaks = seq(from = 50, to = 300, by = 25)) +\n labs(title = \"Height (cm) distribution of Star Wars Characters\") +\n theme_classic()\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(starwars, aes(height, mass)) +\n geom_point() +\n labs(title = \"Mass (kg) by height (cm) distribution of Star Wars Characters\") +\n theme_classic() +\n scale_x_continuous(breaks = seq(from = 0, to = 300, by = 50)) +\n scale_y_continuous(breaks = seq(from = 0, to = 2000, by = 100)) +\n coord_cartesian(xlim = c(0, 300))\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(starwars, aes(x = gender, fill = gender)) +\n geom_bar(show.legend = FALSE, colour = \"black\") +\n scale_x_discrete(name = \"Gender of character\", labels = (c(\"Masculine\", \"Feminine\", \"Missing\"))) +\n scale_fill_brewer(palette = 2) +\n labs(title = \"Number of Star Wars characters of each gender\") +\n theme_bw()\n```\n:::\n\n\n\n
\n\n\n### Make it look nice\n\n* Add at least one Star Wars related image from an online source\n* Hide the code and any messages from the knitted output\n* Resize any images as you see fit\n\n\n
\n\n\n\n::: {.cell layout-align=\"center\" verbatim='r, echo = FALSE, out.width = \"50%\", fig.cap=\"Adaptation of Star Wars logo created by Weweje; original logo by Suzy Rice, 1976. CC-BY-3.0\"'}\n
```{r, echo = FALSE, out.width = \"50%\", fig.cap=\"Adaptation of Star Wars logo created by Weweje; original logo by Suzy Rice, 1976. CC-BY-3.0\"}
\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![Adaptation of Star Wars logo created by Weweje; original logo by Suzy Rice, 1976. CC-BY-3.0](https://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/Star_wars2.svg/2880px-Star_wars2.svg.png){fig-align='center' width=50%}\n:::\n:::\n\n\n\n\n
\n\n\n### Share your work\n\nOnce you're done, share your knitted html file on the Week 4 Teams channel so other learners on the course can see how you approached the task. \n\n\n\n\n\n## Glossary {#sec-glossary-data}\n\n\n::: {.cell layout-align=\"center\"}\n
\n \n
\n
term
\n
definition
\n
\n \n\n
\n
argument
\n
\n
\n
\n
character
\n
\n
\n
\n
console
\n
\n
\n
\n
data type
\n
\n
\n
\n
double
\n
\n
\n
\n
extension
\n
\n
\n
\n
global environment
\n
\n
\n
\n
integer
\n
\n
\n
\n
logical
\n
\n
\n
\n
NA
\n
\n
\n
\n
nominal
\n
\n
\n
\n
numeric
\n
\n
\n
\n
panes
\n
\n
\n
\n
tabular data
\n
\n
\n
\n
tidyverse
\n
\n
\n
\n
URL
\n
\n
\n
\n
vector
\n
\n
\n\n
\n:::\n\n\n## Further resources {#sec-resources-data}\n\n* [Data import cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf)\n* [Chapter 11: Data Import](http://r4ds.had.co.nz/data-import.html) in *R for Data Science*\n* [Multi-row headers](https://psyteachr.github.io/tutorials/multi-row-headers.html)\n\n\n\n\n\n\n\n\n\n",
+ "supporting": [
+ "app-import_files"
+ ],
+ "filters": [
+ "rmarkdown/pagebreak.lua"
+ ],
+ "includes": {},
+ "engineDependencies": {},
+ "preserve": {},
+ "postProcess": true
+ }
+}
\ No newline at end of file
diff --git a/docs/04-data_files/figure-html/unnamed-chunk-25-1.png b/_freeze/app-import/figure-html/unnamed-chunk-25-1.png
similarity index 100%
rename from docs/04-data_files/figure-html/unnamed-chunk-25-1.png
rename to _freeze/app-import/figure-html/unnamed-chunk-25-1.png
diff --git a/docs/04-data_files/figure-html/unnamed-chunk-26-1.png b/_freeze/app-import/figure-html/unnamed-chunk-26-1.png
similarity index 100%
rename from docs/04-data_files/figure-html/unnamed-chunk-26-1.png
rename to _freeze/app-import/figure-html/unnamed-chunk-26-1.png
diff --git a/docs/04-data_files/figure-html/unnamed-chunk-27-1.png b/_freeze/app-import/figure-html/unnamed-chunk-27-1.png
similarity index 100%
rename from docs/04-data_files/figure-html/unnamed-chunk-27-1.png
rename to _freeze/app-import/figure-html/unnamed-chunk-27-1.png
diff --git a/_quarto.yml b/_quarto.yml
index dbce9672..7538e36e 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -23,7 +23,7 @@ book:
# hypothesis:
# theme: clean
# openSidebar: false
- repo-url: https://github.com/psyteachr/ads-v2/
+ repo-url: https://github.com/psyteachr/ads-v3/
repo-branch: master
repo-actions: [edit, issue, source]
# downloads: [pdf, epub]
@@ -33,10 +33,10 @@ book:
# background: light
margin-header: ""
page-footer:
- left: "CC-BY 2022, psyTeachR"
+ left: "CC-BY 2024, psyTeachR"
right:
- icon: github
- href: https://github.com/psyteachr/ads-v2
+ href: https://github.com/psyteachr/ads-v3
- icon: twitter
href: https://twitter.com/psyteachr
- icon: https://zenodo.org/badge/DOI/10.5281/zenodo.6365077.svg
@@ -46,9 +46,9 @@ book:
- 01-intro.qmd
- 02-reports.qmd
- 03-viz.qmd
- - 04-data.qmd
- - 05-summary.qmd
- - 06-formative.qmd
+ - 04-summary.qmd
+ - 05-formative.qmd
+ - 06-ai.qmd
- 07-joins.qmd
- 08-tidy.qmd
- 09-wrangle.qmd
@@ -62,12 +62,10 @@ book:
- app-conventions.qmd
- app-teams.qmd
- app-debugging.qmd
+ - app-import.qmd
- app-datatypes.qmd
- app-dates.qmd
- app-styling.qmd
-# - app-hashtags.qmd
-# - app-twitter.qmd
- - app-spotify.qmd
- app-webpage.qmd
diff --git a/_render.R b/_render.R
deleted file mode 100644
index 0ef94c16..00000000
--- a/_render.R
+++ /dev/null
@@ -1,25 +0,0 @@
-# render only completed chapters ----
-# edit _bookdown_v1.yml to add or remove chapters to rmd_files:
-xfun::in_dir("book", bookdown::render_book(config_file = "_bookdown_v1.yml"))
-browseURL("docs/index.html")
-
-# run if anything in book/data changes ----
-# zip the data files
-zipfile <- "book/data/data"
-if (file.exists(zipfile)) file.remove(zipfile)
-f.zip <- list.files("book/data", full.names = TRUE)
-zip(zipfile, c(f.zip), flags = "-j")
-
-# copy data directory to docs
-R.utils::copyDirectory(
- from = "book/data",
- to = "docs/data",
- overwrite = TRUE,
- recursive = TRUE)
-
-
-#-------------------------------------------------------------------------
-# render a draft book
-# comment out chapters to render a subset
-xfun::in_dir("book", bookdown::render_book(config_file = "_bookdown_draft.yml"))
-browseURL("docs/draft/index.html")
diff --git a/app-hashtags.qmd b/app-hashtags.qmd
deleted file mode 100644
index d2b0766c..00000000
--- a/app-hashtags.qmd
+++ /dev/null
@@ -1,223 +0,0 @@
-# Twitter Hashtags {#sec-twitter-hashtags}
-
-In this appendix, we will create a table of the top five hashtags used in conjunction with #NationalComingOutDay, the total number of tweets in each hashtag, the total number of likes, and the top tweet for each hashtag.
-
-```{r setup-app-g, message=FALSE}
-library(tidyverse) # data wrangling functions
-library(rtweet) # for searching tweets
-library(glue) # for pasting strings
-library(kableExtra) # for nice tables
-```
-
-
-The example below uses the data from @sec-summary (which you can download), but we encourage you to try a hashtag that interests you.
-
-```{r, eval = FALSE}
-# load tweets
-tweets <- search_tweets(q = "#NationalComingOutDay",
- n = 30000,
- include_rts = FALSE)
-
-# save them to a file so you can skip this step in the future
-saveRDS(tweets, file = "data/ncod_tweets.rds")
-```
-
-```{r}
-# load tweets from the file
-tweets <- readRDS("data/ncod_tweets.rds")
-```
-
-## Select relevant data
-
-The function `select()` is useful for just keeping the variables (columns) you need to work with, which can make working with very large datasets easier. The arguments to `select()` are simply the names of the variables and the resulting table will present them in the order you specify.
-
-```{r}
-tweets_with_hashtags <- tweets %>%
- select(hashtags, text, favorite_count, media_url)
-```
-
-
-## Unnest columns
-
-Look at the dataset using `View(tweets_with_hashtags)` or clicking on it in the Environment tab. You'll notice that the variable `hashtags` has multiple values in each cell (i.e., when users used more than one hashtag in a single tweet). In order to work with this information, we need to separate each hashtag so that each row of data represents a single hashtag. We can do this using the function `unnest()` and adding a pipeline of code.
-
-```{r}
-tweets_with_hashtags <- tweets %>%
- select(hashtags, text, favorite_count, media_url) %>%
- unnest(cols = hashtags)
-```
-
-::: {.callout-note .try}
-Look at `tweets_with_hashtags` to see how it is different from the table `tweets`. WHy does it have more rows?
-:::
-
-
-## Top 5 hashtags
-
-To get the top 5 hashtags we need to know how tweets used each one. This code uses pipes to build up the analysis. When you encounter multi-pipe code, it can be very useful to run each line of the pipeline to see how it builds up and to check the output at each step. This code:
-
-* Starts with the object `tweets_with_hashtags` and then;
-* Counts the number of tweets for each hashtag using `count()` and then;
-* Filters out any blank cells using `!is.na()` (you can read this as "keep any row value where it is not true (`!`) that `hashtags` is missing") and then;
-* Returns the top five values using `slice_max()` and orders them by the `n` column.
-
-```{r}
-top5_hashtags <- tweets_with_hashtags %>%
- count(hashtags) %>%
- filter(!is.na(hashtags)) %>% # get rid of the blank value
- slice_max(order_by = n, n = 5)
-
-top5_hashtags
-```
-
-Two of the hashtags are the same, but with different case. We can fix this by adding in an extra line of code that uses `mutate()` to overwrite the variable `hashtag` with the same data but transformed to lower case using `tolower()`. Since we're going to use the table `tweets_with_hashtags` a few more times, let's change that table first rather than having to fix this every time we use the table.
-
-```{r}
-tweets_with_hashtags <- tweets_with_hashtags %>%
- mutate(hashtags = tolower(hashtags))
-
-top5_hashtags <- tweets_with_hashtags %>%
- count(hashtags) %>%
- filter(!is.na(hashtags)) %>% # get rid of the blank value
- slice_max(order_by = n, n = 5)
-
-top5_hashtags
-```
-
-## Top tweet per hashtag
-
-Next, get the top tweet for each hashtag using `filter()`. Use `group_by()` before you filter to select the most-liked tweet in each hashtag, rather than the one with most likes overall. As you're getting used to writing and running this kind of multi-step code, it can be very useful to take out individual lines and see how it changes the output to strengthen your understanding of what each step is doing.
-
-```{r}
-top_tweet_per_hashtag <- tweets_with_hashtags %>%
- group_by(hashtags) %>%
- filter(favorite_count == max(favorite_count)) %>%
- sample_n(size = 1) %>%
- ungroup()
-```
-
-::: {.callout-note .try}
-The function `slice_max()` accomplishes the same thing as the `filter()` and `sample_n()` functions above. Look at the help for this function and see if you can figure out how to use it.
-
-```{r, webex.hide = TRUE, eval = FALSE}
-top_tweet_per_hashtag <- tweets_with_hashtags %>%
- group_by(hashtags) %>%
- slice_max(
- order_by = favorite_count,
- n = 1, # select the 1 top value
- with_ties = FALSE # don't include ties
- ) %>%
- ungroup()
-```
-
-:::
-
-## Total likes per hashtag
-
-Get the total number of likes per hashtag by grouping and summarising with `sum()`.
-
-```{r}
-likes_per_hashtag <- tweets_with_hashtags %>%
- group_by(hashtags) %>%
- summarise(total_likes = sum(favorite_count)) %>%
- ungroup()
-```
-
-## Put it together
-
-We can put everything together using `left_join()` (see @sec-left_join). This will keep everything from the first table specified and then add on the relevant data from the second table specified. In this case, we add on the data in `top_tweet_per_hashtag` and `like_per_hashtag` but only for the tweets included in `top5_hashtags`
-
-```{r}
-top5 <- top5_hashtags %>%
- left_join(top_tweet_per_hashtag, by = "hashtags") %>%
- left_join(likes_per_hashtag, by = "hashtags")
-```
-
-## Twitter data idiosyncrasies
-
-Before we can finish up though, there's a couple of extra steps we need to add in to account for some of the idiosyncrasies of Twitter data.
-
-First, the `@` symbol is used by R Markdown for referencing (see @sec-references). It's likely that some of the tweets will contain this symbol, so we can use mutate to find any instances of `@` and `r glossary("escape")` them using backslashes. Backslashes create a `r glossary("literal")` version of characters that have a special meaning in R, so adding them means it will print the `@` symbol without trying to create a reference. Of course `\` also has a special meaning in R, which means we also need to backslash the backslash. Isn't programming fun? We can use the same code to tidy up any ampersands (&), which sometimes display as "&".
-
-Second, if there are multiple images associated with a single tweet, `media_url` will be a list, so we use `unlist()`. This might not be necessary for a different set of tweets; use `glimpse()` to check the data types.
-
-Finally, we use `select()` to tidy up the table and just keep the columns we need.
-
-```{r}
-top5 <- top5_hashtags %>%
- left_join(top_tweet_per_hashtag, by = "hashtags") %>%
- left_join(likes_per_hashtag, by = "hashtags") %>%
- # replace @ with \@ so @ doesn't trigger referencing
- mutate(text = gsub("@", "\\\\@", text),
- text = gsub("&", "&", text)) %>%
- # media_url can be a list if there is more than one image
- mutate(image = unlist(media_url)) %>%
- # put the columns you want to display in order
- select(hashtags, n, total_likes, text, image)
-
-top5
-```
-
-## Make it prettier
-
-Whilst this table now has all the information we want, it isn't great aesthetically. The kableExtra package has functions that will improve the presentation of tables. We're going to show you two examples of how you could format this table.
-
-The first is (relatively) simple and stays within the R programming language using functionality from kableExtra. The main aesthetic feature of the table is the incorporation of the pride flag colours for each row. Each row is set to a different colour of the pride flag and the text is set to be black and bold to improve the contrast. We've also removed the `image` column, as it just contains a URL.
-
-```{r}
-# the hex codes of the pride flag colours, obtained from https://www.schemecolor.com/lgbt-flag-colors.php
-
-# the last two characters (80) make the colours semi-transparent.
-# omitting them or setting to FF make them 100% opaque
-
-pride_colours <- c("#FF001880",
- "#FFA52C80",
- "#FFFF4180",
- "#00801880",
- "#0000F980",
- "#86007D80")
-
-top5 %>%
- select(-image) %>%
- kable(col.names = c("Hashtags", "No. tweets", "Likes", "Tweet"),
- caption = "Stats and the top tweet for the top five hashtags.",
-
- ) %>%
- kable_paper() %>%
- row_spec(row = 0:5, bold = T, color = "black") %>%
- row_spec(row = 0, font_size = 18,
- background = pride_colours[1]) %>%
- row_spec(row = 1, background = pride_colours[2])%>%
- row_spec(row = 2, background = pride_colours[3])%>%
- row_spec(row = 3, background = pride_colours[4])%>%
- row_spec(row = 4, background = pride_colours[5])%>%
- row_spec(row = 5, background = pride_colours[6])
-```
-
-## Customise with HTML
-
-An alternative approach incorporates `r glossary("HTML")` and also uses the package glue to combine information from multiple columns.
-
-First, we use `mutate()` to create a new column `col1` that combines the first three columns into a single column and adds some formatting to make the hashtag bold (``) and insert line breaks (` `). We'll also change the image column to display the image using html if there is an image.
-
-If you're not familiar with HTML, don't worry if you don't understand the below code; the point is to show you the full extent of the flexibility available.
-
-```{r eval = TRUE}
-top5 %>%
- mutate(col1 = glue("#{hashtags}
-
- tweets: {n}
-
- likes: {total_likes}"),
- img = ifelse(!is.na(image),
- glue(""),
- "")) %>%
- select(col1, text, img) %>%
- kable(
- escape = FALSE, # allows HTML in the table
- col.names = c("Hashtag", "Top Tweet", ""),
- caption = "Stats and the top tweet for the top five hashtags.") %>%
- column_spec(1:2, extra_css = "vertical-align: top;") %>%
- row_spec(0, extra_css = "vertical-align: bottom;") %>%
- kable_paper()
-```
diff --git a/04-data.qmd b/app-import.qmd
similarity index 99%
rename from 04-data.qmd
rename to app-import.qmd
index 3bec2529..b29856d7 100644
--- a/04-data.qmd
+++ b/app-import.qmd
@@ -1,5 +1,7 @@
# Data Import {#sec-data}
+THIS APPENDIX IS BEING UPDATED...
+
## Intended Learning Outcomes {#sec-ilo-data .unnumbered}
* Be able to inspect data
diff --git a/app-spotify.qmd b/app-spotify.qmd
deleted file mode 100644
index bee83220..00000000
--- a/app-spotify.qmd
+++ /dev/null
@@ -1,383 +0,0 @@
-# Spotify Data {#sec-spotify}
-
-This appendix was inspired by [Michael Mullarkey's tutorial](https://mcmullarkey.github.io/mcm-blog/posts/2022-01-07-spotify-api-r/){target="_blank"}, which you can follow to make beautiful dot plots out of your own Spotify data. This tutorial doesn't require you to use Spotify; just to create a developer account so you can access their data API with spotifyr", "https://www.rcharlie.com/spotifyr/")`.
-
-```{r setup-app-spotify}
-library(usethis) # to set system environment variables
-library(spotifyr) # to access Spotify
-library(tidyverse) # for data wrangling
-library(DT) # for interactive tables
-```
-
-The package [spotifyr](https://www.rcharlie.com/spotifyr){target="_blank"} has instructions for setting up a developer account with Spotify and setting up an "app" so you can get authorisation codes.
-
-Once you've set up the app, you can copy the client ID and secret to your R environment file. The easiest way to do this is with `edit_r_environ()` from usethis. Setting scope to "user" makes this available to any R project on your computer, while setting it to "project" makes it only available to this project.
-
-```{r, eval = FALSE}
-usethis::edit_r_environ(scope = "user")
-```
-
-Add the following text to your environment file (don't delete anything already there), replacing the zeros with your personal ID and secret. Save and close the file and restart R.
-
-```
-SPOTIFY_CLIENT_ID="0000000000000000000000000000"
-SPOTIFY_CLIENT_SECRET="0000000000000000000000000000"
-```
-
-Double check that it worked by typing the following into the console. Don't put it in your script unless you mean to share this confidential info. You should see your values, not "", if it worked.
-
-```{r, eval = FALSE}
-# run in the console, don't save in a script
-Sys.getenv("SPOTIFY_CLIENT_ID")
-Sys.getenv("SPOTIFY_CLIENT_SECRET")
-```
-
-Now you're ready to get data from Spotify. There are several types of data that you can download.
-
-```{r, include = FALSE}
-# avoids calling spotify repeatedly and allows knitting even when you don't have a connection to Spotify
-
-#saveRDS(euro_genre2, "R/euro_genre2.Rds")
-
-gaga <- readRDS("R/gaga.Rds")
-eurovision2021 <- readRDS("R/eurovision2021.Rds")
-euro_genre <- readRDS("R/euro_genre.Rds")
-euro_genre2 <- readRDS("R/euro_genre2.Rds")
-euro_genre200 <- readRDS("R/euro_genre200.Rds")
-btw_analysis <- readRDS("R/btw_analysis.Rds")
-btw_features <- readRDS("R/btw_features.Rds")
-
-```
-
-
-## By Artist
-
-Choose your favourite artist and download their discography. Set `include_groups` to one or more of "album", "single", "appears_on", and "compilation".
-
-```{r, eval=FALSE}
-gaga <- get_artist_audio_features(
- artist = 'Lady Gaga',
- include_groups = "album"
-)
-```
-
-Let's explore the data you get back. Use `glimpse()` to see what columns are available and what type of data they have. It looks like there is a row for each of this artist's tracks.
-
-Let's answer a few simple questions first.
-
-### Tracks per Album
-
-How many tracks are on each album? Some tracks have more than one entry in the table, so first select just the `album_name` and `track_name` columns and use `distinct()` to get rid of duplicates. Then `count()` the tracks per album. We're using `DT::datatable()` to make the table interactive. Try sorting the table by number of tracks.
-
-```{r}
-gaga %>%
- select(album_name, track_name) %>%
- distinct() %>%
- count(album_name) %>%
- datatable(colnames = c("Albumn Name", "Number of Tracks"))
-```
-
-::: {.callout-note .try}
-Use `count()` to explore the columns `key_name`, `mode_name`, and any other non-numeric columns.
-:::
-
-### Tempo
-
-What sort of tempo is Lady Gaga's music? First, let's look at a very basic plot to get an overview.
-
-```{r}
-ggplot(gaga, aes(tempo)) +
- geom_histogram(binwidth = 1)
-```
-
-What's going on with the tracks with a tempo of 0?
-
-```{r}
-gaga %>%
- filter(tempo == 0) %>%
- select(album_name, track_name)
-```
-
-Looks like it's all dialogue, so we should omit these. Let's also check how variable the tempo is for multiple instances of the same track. A quick way to do this is to group by album and track, then check the `r glossary("standard deviation")` of the tempo. If it's 0, this means that all of the values are identical. The bigger it is, the more the values vary. If you have a lot of data with a `r glossary("normal distribution")` (like a bell curve), then about 68% of the data are within one SD of the mean, and about 95% are within 2 SDs.
-
-If we filter to tracks with SD greater than 0 (so any variation at all), we see that most tracks have a little variation. However, if we filter to tracks with an SD greater than 1, we see a few songs with slightly different tempo, and a few with wildly different tempo.
-
-```{r}
-gaga %>%
- # omit tracks with "Dialogue" in the name
- filter(!str_detect(track_name, "Dialogue")) %>%
- # check for varying tempos for same track
- group_by(album_name, track_name) %>%
- filter(sd(tempo) > 1) %>%
- ungroup() %>%
- select(album_name, track_name, tempo) %>%
- arrange(album_name, track_name)
-```
-
-You can deal with these in any way you choose. Filter out some versions of the songs or listen to them to see which value you agree with and change the others. Here, we'll deal with it by averaging the values for each track. This will also remove the tiny differences in the majority of duplicate tracks. Now we're ready to plot.
-
-```{r}
-gaga %>%
- filter(tempo > 0) %>%
- group_by(album_name, track_name) %>%
- summarise(tempo = round(mean(tempo)),
- .groups = "drop") %>%
- ungroup() %>%
- ggplot(aes(x = tempo, fill = ..x..)) +
- geom_histogram(binwidth = 4, show.legend = FALSE) +
- scale_fill_gradient(low = "#521F64", high = "#E8889C") +
- labs(x = "Beats per minute",
- y = "Number of tracks",
- title = "Tempo of Lady Gaga Tracks")
-```
-
-::: {.callout-note .try}
-Can you see how we made the gradient fill for the histograms? Since the x-value of each bar depends on the binwidth, you have to use the code `..x..` in the mapping (not `tempo`) to make the fill correspond to each bar's value.
-:::
-
-This looks OK, but maybe we want a more striking plot. Let's make a custom plot style and assign it to `gaga_style` in case we want to use it again. Then add it to the shortcut function, `last_plot()` to avoid having to retype the code for the last plot we created.
-
-```{r}
-# define style
-gaga_style <- theme(
- plot.background = element_rect(fill = "black"),
- text = element_text(color = "white", size = 11),
- panel.background = element_rect(fill = "black"),
- panel.grid.major.x = element_blank(),
- panel.grid.minor.x = element_blank(),
- panel.grid.major.y = element_line(colour = "white", size = 0.2),
- panel.grid.minor.y = element_line(colour = "white", size = 0.2),
- axis.text = element_text(color = "white"),
- plot.title = element_text(hjust = 0.5)
-)
-
-## add it to the last plot created
-last_plot() + gaga_style
-
-```
-
-
-## By Playlist
-
-You need to know the "uri" of a public playlist to access data on it. You can get this by copying the link to the playlist and selecting the 22 characters between "https://open.spotify.com/playlist/" and "?si=...". Let's have a look at the Eurovision 2021 playlist.
-
-```{r, eval = FALSE}
-eurovision2021 <- get_playlist_audio_features(
- playlist_uris = "37i9dQZF1DWVCKO3xAlT1Q"
-)
-```
-
-Use `glimpse()` and `count()` to explore the structure of this table.
-
-### Track ratings
-
-Each track has several ratings: danceability, energy, speechiness, acousticness, instrumentalness, liveness, and valence. I'm not sure how these are determined (almost certainly by an algorithm). Let's select the track names and these columns to have a look.
-
-```{r}
-eurovision2021 %>%
- select(track.name, danceability, energy, speechiness:valence) %>%
- datatable()
-```
-
-What was the general mood of Eurovision songs in 2021? Let's use plots to assess. First, we need to get the data into long format to make it easier to plot multiple attributes.
-
-```{r}
-playlist_attributes <- eurovision2021 %>%
- select(track.name, danceability, energy, speechiness:valence) %>%
- pivot_longer(cols = danceability:valence,
- names_to = "attribute",
- values_to = "rating")
-```
-
-When we plot everything on the same plot, instrumentalness has such a consistently low value that all the other attributes disappear,
-
-```{r}
-ggplot(playlist_attributes, aes(x = rating, colour = attribute)) +
- geom_density()
-```
-
-You can solve this by putting each attribute into its own facet and letting the y-axis differ between plots by setting `scales = "free_y"`. Now it's easier to see that Eurovision songs tend to have pretty high danceability and energy.
-
-```{r}
-#| playlist-attributes-facet,
-#| fig.width = 10, fig.height = 5,
-#| fig.cap = "Seven track attributes for the playlist 'Eurovision 2021'"
-ggplot(playlist_attributes, aes(x = rating, colour = attribute)) +
- geom_density(show.legend = FALSE) +
- facet_wrap(~attribute, scales = "free_y", nrow = 2)
-```
-
-### Popularity
-
-Let's look at how these attributes relate to track popularity. We'll exclude instrumentalness, since it doesn't have much variation.
-
-```{r}
-popularity <- eurovision2021 %>%
- select(track.name, track.popularity,
- acousticness, danceability, energy,
- liveness, speechiness, valence) %>%
- pivot_longer(cols = acousticness:valence,
- names_to = "attribute",
- values_to = "rating")
-```
-
-
-```{r}
-#| playlist-popularity,
-#| fig.width = 7.5, fig.height = 5,
-#| fig.cap = "The relationship between track attributes and popularity."
-ggplot(popularity, aes(x = rating, y = track.popularity, colour = attribute)) +
- geom_point(alpha = 0.5, show.legend = FALSE) +
- geom_smooth(method = lm, formula = y~x, show.legend = FALSE) +
- facet_wrap(~attribute, scales = "free_x", nrow = 2) +
- labs(x = "Attribute Value",
- y = "Track Popularity")
-```
-
-
-### Nested data
-
-Some of the columns in this table contain more tables. For example, each entry in the `track.artist` column contains a table with columns `href`, `id`, `name`, `type`, `uri`, and `external_urls.spotify`. Use `unnest()` to extract these tables. If there is more than one artist for a track, this will expand the table. For example, the track "Adrenalina" has two rows now, one for Senhit and one for Flo Rida.
-
-```{r}
-eurovision2021 %>%
- unnest(track.artists) %>%
- select(track = track.name,
- artist = name,
- popularity = track.popularity) %>%
- datatable()
-
-```
-
-
-::: {.callout-note .try}
-If you're a Eurovision nerd (like Emily), try downloading playlists from several previous years and visualise trends. See if you can find lists of the [scores for each year](https://en.wikipedia.org/wiki/Eurovision_Song_Contest_2021){target="_blank"} and join the data to see what attributes are most related to points.
-:::
-
-## By Genre
-
-Select the first 20 artists in the genre "eurovision". So that people don't spam the Spotify API, you are limited to up to 50 artists per request.
-
-```{r, eval=FALSE}
-euro_genre <- get_genre_artists(
- genre = "eurovision",
- limit = 20,
- offset = 0
-)
-```
-
-
-```{r}
-euro_genre %>%
- select(name, popularity, followers.total) %>%
- datatable()
-```
-
-Now you can select the next 20 artists, incrementing the offset by 20, join that to the first table, and process the data.
-
-```{r, eval=FALSE}
-euro_genre2 <- get_genre_artists(
- genre = "eurovision",
- limit = 20,
- offset = 20
-)
-```
-
-
-```{r}
-bind_rows(euro_genre, euro_genre2) %>%
- select(name, popularity, followers.total) %>%
- datatable()
-```
-
-### Repeated calls
-
-There is a programmatic way to make several calls to a function that limits you. You usually want to set this up so that you are waiting a few seconds or minutes between calls so that you don't get locked out (depending on how strict the API is). Use `map_df()` to automatically join the results into one big table.
-
-```{r, eval = FALSE}
-# create a slow version of get_genre_artists
-# delays 2 seconds after running
-slow_get_genre_artists <- slowly(get_genre_artists,
- rate = rate_delay(2))
-
-# set 4 offsets from 0 to 150 by 50
-offsets <- seq(0, 150, 50)
-
-# run the slow function once for each offset
-euro_genre200 <- map_df(.x = offsets,
- .f = ~slow_get_genre_artists("eurovision",
- limit = 50,
- offset = .x))
-```
-
-
-```{r}
-euro_genre200 %>%
- select(name, popularity, followers.total) %>%
- arrange(desc(followers.total)) %>%
- datatable()
-```
-
-
-## By Track
-
-You can get even more info about a specific track if you know its Spotify ID. You can get this from an artist, album, or playlist tables.
-
-```{r}
-# get the ID for Born This Way from the original album
-btw_id <- gaga %>%
- filter(track_name == "Born This Way",
- album_name == "Born This Way") %>%
- pull(track_id)
-```
-
-### Features
-
-Features are a list of summary attributes of the track. These are also included in the previous tables, so this function isn't very useful unless you are getting track IDs directly.
-
-```{r, eval = FALSE}
-btw_features <- get_track_audio_features(btw_id)
-```
-
-```{r, echo = FALSE}
-str(btw_features)
-```
-
-### Analysis
-
-The analysis gives you seven different tables of details about the track. Use the `names()` function to see their names and look at each object to see what information it contains.
-
-```{r, eval = FALSE}
-btw_analysis <- get_track_audio_analysis(btw_id)
-```
-
-```{r, echo = FALSE}
-names(btw_analysis)
-```
-
-* `meta` gives you a list of some info about the analysis.
-* `track` gives you a list of attributes, including `duration`, `loudness`, `end_of_fade_in`, `start_of_fade_out`, and `time_signature`. Some of this info was available in the previous tables.
-* `bars`, `beats`, and `tatums` are tables with the start, duration and confidence for each bar, beat, or tatum of music (whatever a "tatum" is).
-* `sections` is a table with the start, duration, loudness, tempo, key, mode, time signature for each section of music, along with confidence measures of each.
-* `segments` is a table with information about loudness, pitch and timbre of segments of analysis, which tend to be around 0.2 (seconds?)
-
-You can use this data to map a song.
-
-```{r}
-#| track-segment-map,
-#| fig.cap = "Use data from the segments table of a track analysis to plot loudness over time."
-ggplot(btw_analysis$segments, aes(x = start,
- y = loudness_start,
- color = loudness_start)) +
- geom_point(show.legend = FALSE) +
- scale_colour_gradient(low = "red", high = "purple") +
- scale_x_continuous(breaks = seq(0, 300, 30)) +
- labs(x = "Seconds",
- y = "Loudness",
- title = "Loudness Map for 'Born This Way'") +
- gaga_style
-```
-
-
-
-
diff --git a/app-twitter.qmd b/app-twitter.qmd
deleted file mode 100644
index b53f0709..00000000
--- a/app-twitter.qmd
+++ /dev/null
@@ -1,335 +0,0 @@
-# Twitter Data {#sec-twitter-data}
-
-This appendix takes a problem-based approach to demonstrate how to use tidyverse functions to summarise and visualise twitter data.
-
-```{r setup-app-h, message=FALSE}
-library(tidyverse) # data wrangling functions
-library(lubridate) # for handling dates and times
-```
-
-## Single Data File
-
-### Export Data
-
-You can export your organisations' twitter data from . Go to the Tweets tab, choose a time period, and export the last month's data by day (or use the files from the [class data](data/data.zip)).
-
-### Import Data
-
-```{r, message=FALSE}
-file <- "data/tweets/daily_tweet_activity_metrics_LisaDeBruine_20210801_20210901_en.csv"
-
-daily_tweets <- read_csv(file, show_col_types = FALSE)
-```
-
-### Select Relevant Data
-
-The file contains a bunch of columns about "promoted" tweets that will be blank unless your organisation pays for those. Let's get rid of them. We can use the select helper `starts_with()` to get all the columns that start with `"promoted")` and remove them by prefacing the function with `!`. Now there should be 20 columns, which we can inspect with `glimpse()`.
-
-```{r, message=FALSE}
-daily_tweets <- read_csv(file) %>%
- select(!starts_with("promoted")) %>%
- glimpse()
-```
-
-
-### Plot Likes per Day
-
-Now let's plot likes per day. The `scale_x_date()` function lets you formats an x-axis with dates.
-
-```{r likes-per-day-plot, fig.cap="Likes per day."}
-ggplot(daily_tweets, aes(x = Date, y = likes)) +
- geom_line() +
- scale_x_date(name = "",
- date_breaks = "1 day",
- date_labels = "%d",
- expand = expansion(add = c(.5, .5))) +
- ggtitle("Likes: August 2021")
-```
-
-
-### Plot Multiple Engagements
-
-What if we want to plot likes, retweets, and replies on the same plot? We need to get all of the numbers in the same column and a column that contains the "engagement type" that we can use to determine different line colours. When you have data in different columns that you need to get into the same column, it's wide and you need to pivot the data longer.
-
-```{r}
-long_tweets <- daily_tweets %>%
- select(Date, likes, retweets, replies) %>%
- pivot_longer(cols = c(likes, retweets, replies),
- names_to = "engage_type",
- values_to = "n")
-
-head(long_tweets)
-```
-
-Now we can plot the number of engagements per day by engagement type by making the line colour determined by the value of the `engage_type` column.
-
-```{r eng-per-day-plot, fig.cap="Engagements per day by engagement type."}
-ggplot(long_tweets, aes(x = Date, y = n, colour = engage_type)) +
- geom_line() +
- scale_x_date(name = "",
- date_breaks = "1 day",
- date_labels = "%d",
- expand = expansion(add = c(.5, .5))) +
- scale_y_continuous(name = "Engagements per Day") +
- scale_colour_discrete(name = "") +
- ggtitle("August 2021") +
- theme(legend.position = c(.9, .8),
- panel.grid.minor = element_blank())
-
-```
-
-
-## Multiple Data Files
-
-Maybe now you want to compare the data from several months. First, get a list of all the files you want to combine. It's easiest if they're all in the same directory, although you can use a pattern to select the files you want if they have a systematic naming structure.
-
-```{r}
-files <- list.files(
- path = "data/tweets",
- pattern = "daily_tweet_activity_metrics",
- full.names = TRUE
-)
-```
-
-Then use `map_df()` to iterate over the list of file paths, open them with `read_csv()`, and return a big data frame with all the combined data. Then we can pipe that to the `select()` function to get rid of the "promoted" columns.
-
-```{r, message=FALSE}
-all_daily_tweets <- purrr::map_df(files, read_csv) %>%
- select(!starts_with("promoted"))
-```
-
-Now you can make a plot of likes per day for all of the months.
-
-```{r}
-ggplot(all_daily_tweets, aes(x = Date, y = likes)) +
- geom_line(colour = "dodgerblue") +
- scale_y_continuous(name = "Likes per Day") +
- scale_x_date(name = "",
- date_breaks = "1 month",
- date_labels = "%B",
- expand = expansion(add = c(.5, .5))) +
- ggtitle("Likes 2021")
-
-```
-
-
-::: {.callout-note}
-Notice that we changed the date breaks and labels for the x-axis. `%B` is the date code for the full month name. See `?strptime` for all of the date codes.
-:::
-
-
-### Likes by Month
-
-If you want to plot likes by month, first you need a column for the month. Use `mutate()` to make a new column, using `lubridate::month()` to extract the month name from the `Date` column.
-
-Then group by the new `month` column and calculate the sum of `likes`. The `group_by()` function causes all of the subsequent functions to operate inside of groups, until you call `ungroup()`. In the example below, the `sum(likes)` function calculates the sum total of the `likes` column separately for each month.
-
-```{r}
-likes_by_month <- all_daily_tweets %>%
- mutate(month = month(Date, label = TRUE)) %>%
- group_by(month) %>%
- summarise(total_likes = sum(likes)) %>%
- ungroup()
-
-likes_by_month
-```
-
-
-A column plot might make more sense than a line plot for this summary.
-
-```{r likes-by-month-plot, fig.cap = "Likes by month."}
-ggplot(likes_by_month, aes(x = month, y = total_likes, fill = month)) +
- geom_col(color = "black", show.legend = FALSE) +
- scale_x_discrete(name = "") +
- scale_y_continuous(name = "Total Likes per Month",
- breaks = seq(0, 10000, 1000),
- labels = paste0(0:10, "K")) +
- scale_fill_brewer(palette = "Spectral")
-```
-
-
-::: {.callout-note .try}
-How would you change the code in this section to plot the number of tweets published per week?
-
-Hint: if the lubridate function for the month is `month()`, what is the function for getting the week likely to be?
-
-```{r, webex.hide="Summarise Data"}
-tweets_by_week <- all_daily_tweets %>%
- mutate(week = week(Date)) %>%
- group_by(week) %>%
- summarise(start_date = min(Date),
- total_tweets = sum(`Tweets published`)) %>%
- ungroup()
-```
-
-```{r, webex.hide="Plot Data"}
-ggplot(tweets_by_week, aes(x = start_date, y = total_tweets)) +
- geom_col(fill = "hotpink") +
- scale_x_date(name = "",
- date_breaks = "1 month",
- date_labels = "%B") +
- scale_y_continuous(name = "Total Tweets per Week")
-```
-
-:::
-
-
-## Data by Tweet
-
-You can also download your twitter data by tweet instead of by day. This usually takes a little longer to download. We can use the same pattern to read and combine all of the tweet data files.
-
-The `^` at the start of the pattern means that the file name has to start with this. This means it won't match the "daily_tweet..." files.
-
-```{r}
-tweet_files <- list.files(
- path = "data/tweets",
- pattern = "^tweet_activity_metrics",
- full.names = TRUE
-)
-```
-
-First, let's open only the first file and see if we need to do anything to it.
-
-```{r, message=FALSE}
-tweets <- read_csv(tweet_files[1])
-```
-
-If you look at the file in the Viewer, you can set that the `Tweet id` column is using scientific notation (`1.355500e+18`) instead of the full 18-digit tweet ID, which gives different IDs the same value. We won't ever want to *add* ID numbers,so it's safe to represent these as characters. Set up the map over all the files with the `col_types` specified, then get rid of all the promoted columns and add `month` and `hour` columns (reading the date from the `time` column in these data).
-
-```{r, warning=FALSE}
-ct <- cols("Tweet id" = col_character())
-all_tweets <- map_df(tweet_files, read_csv, col_types = ct) %>%
- select(!starts_with("promoted")) %>%
- mutate(month = lubridate::month(time, label = TRUE),
- hour = lubridate::hour(time))
-```
-
-### Impressions per Tweet
-
-Now we can look at the distribution of impressions per tweet for each month.
-
-```{r imp-month-plot, fig.cap="Impressions per tweet per month."}
-ggplot(all_tweets, aes(x = month, y = impressions, fill = month)) +
- geom_violin(show.legend = FALSE, alpha = 0.8) +
- scale_fill_brewer(palette = "Spectral") +
- scale_x_discrete(name = "") +
- scale_y_continuous(name = "Impressions per Tweet",
- breaks = c(0, 10^(2:7)),
- labels = c(0, 10, 100, "1K", "10K", "100K", "1M"),
- trans = "pseudo_log") +
- ggtitle("Distribution of Twitter Impressions per Tweet in 2021")
-```
-
-::: {.callout-note .try}
-The y-axis has been transformed to "pseudo_log" to show very skewed data more clearly (most tweets get a few hundred impressions, but some a few can get thousands). See what the plot looks like if you change the y-axis transformation.
-:::
-
-### Top Tweet
-
-You can display Lisa's top tweet for the year.
-
-```{r, results='asis'}
-top_tweet <- all_tweets %>%
- slice_max(order_by = likes, n = 1)
-
-glue::glue("[Top tweet]({top_tweet$`Tweet permalink`}) with {top_tweet$likes} likes:
-
----------------------------
-{top_tweet$`Tweet text`}
----------------------------
-") %>% cat()
-```
-
-### Word Cloud
-
-Or you can make a word cloud of the top words they tweet about. (You'll learn how to do this in @sec-word-clouds).
-
-```{r, echo = FALSE, message=FALSE}
-library(tidytext)
-library(ggwordcloud)
-
-omitted <- c(stop_words$word, 0:9,
- "=", "+", "lt", "gt",
- "im", "id", "ill", "ive", "isnt",
- "doesnt", "dont", "youre", "didnt")
-
-words <- all_tweets %>%
- unnest_tokens(output = "word",
- input = "Tweet text",
- token = "tweets") %>%
- count(word) %>%
- filter(!word %in% omitted) %>%
- slice_max(order_by = n, n = 50, with_ties = FALSE)
-
-ggplot(words, aes(label = word, colour = word, size = n)) +
- geom_text_wordcloud_area() +
- scale_size_area(max_size = 17) +
- theme_minimal(base_size = 14) +
- scale_color_hue(h = c(100, 420), l = 50)
-```
-
-### Tweets by Hour
-
-In order to make a plot of tweets by hour, colouring the data by wherther or not the sun is up, we can join data from a table of sunrise and sunset times by day for Glasgow (or [download the table for your region](https://www.schoolsobservatory.org/learn/astro/nightsky/sunrs_set){target="_blank"}).
-
-The `Day` column originally read in as a character column, so convert it to a date on import using the `col_types` argument.
-
-```{r}
-sun <- read_csv("data/sunfact2021.csv",
- col_types = cols(
- Day = col_date(format="%d/%m/%Y"),
- RiseTime = col_double(),
- SetTime = col_double()
- ))
-```
-
-Create a matching `Day` column for `all_tweets`, plus an `hour` column for plotting (the factor structure starts the day at 8:00), and a `tweet_time` column for comparing to the `RiseTime` and `SetTime` columns, which are decimal hours.
-
-Then join the `sun` table and create a `timeofday` column that equals "day" if the sun is up and "night" if the sun has set.
-
-```{r}
-sun_tweets <- all_tweets %>%
- select(time) %>%
- mutate(Day = date(time),
- hour = factor(hour(time),
- levels = c(8:23, 0:7)),
- tweet_time = hour(time) + minute(time)/60) %>%
- left_join(sun, by = "Day") %>%
- mutate(timeofday = ifelse(tweet_time>RiseTime &
- tweet_time
-
+
Applied Data Skills - 1 Intro to R and RStudio
@@ -133,184 +104,218 @@
-