forked from EmilHvitfeldt/smltar
-
Notifications
You must be signed in to change notification settings - Fork 0
/
13_data.Rmd
66 lines (39 loc) · 4.05 KB
/
13_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Data {#appendixdata}
There are four main text data sets we use throughout this book to demonstrate building features for machine learning and training models. These data sets include texts of different languages, different lengths (short to long), and from very recent time periods to a few hundred years ago.
These text data sets are not overly difficult to read into memory and prepare for analysis; by contrast, in many text modeling projects, the data itself may be in any of a number of formats from an API to literal paper. Practitioners may need to use skills such as web scraping or connecting to databases to even begin their work.
## Hans Christian Andersen fairy tales {#hcandersen}
The **hcandersenr** [@R-hcandersenr] package includes the text of the 157 known fairy tales by the Danish author Hans Christian Andersen (1805–1875).
There are five different languages available, with:
- 156 fairy tales in English,
- 154 in Spanish,
- 150 in German,
- 138 in Danish, and
- 58 in French.
The package contains a data set for each language with the naming convention `hcandersen_**`,
where `**` is a country code.
Each data set comes as a dataframe with two columns, `text` and `book`, where the `book` variable has the text divided into strings of up to 80 characters.
The package also makes available a data set called `EK`, which includes information about the publication date, language of origin, and names in the different languages.
This data set is used in Chapters \@ref(tokenization), \@ref(stopwords), and \@ref(stemming).
## Opinions of the Supreme Court of the United States {#scotus-opinions}
The **scotus** [@R-scotus] package contains a sample of the Supreme Court of the United States' opinions.
The `scotus_sample` dataframe includes one opinion per row along with the year, case name, docket number, and a unique ID number.
The text has had minimal preprocessing and includes header information in the text field, such as shown here:
```{r, echo=FALSE}
library(scotus)
cat(paste(stringr::str_wrap(head(strsplit(scotus_sample$text[42], "\n")[[1]])), collapse = "\n"))
```
This data set is used in Chapters \@ref(stemming), \@ref(mlregression), and \@ref(dllstm).
## Consumer Financial Protection Bureau (CFPB) complaints {#cfpb-complaints}
```{r echo=FALSE}
library(tidyverse)
complaints <- read_csv("data/complaints.csv.gz")
```
Consumers can submit complaints to the [United States Consumer Financial Protection Bureau (CFPB)](https://www.consumerfinance.gov/data-research/consumer-complaints/) about financial products and services; the CFPB sends the complaints to companies for response.
The data set of consumer complaints used in this book has been filtered to `r scales::comma(nrow(complaints))` complaints submitted to the CFPB after January 1, 2019 that include a consumer complaint narrative (i.e., some submitted text). Each observation has a `complaint_id`, various categorical variables, and a text column `consumer_complaint_narrative` containing the written complaints, for a total of `r scales::comma(ncol(complaints))` columns.
This data set is used in Chapters \@ref(embeddings) and \@ref(mlclassification).
## Kickstarter campaign blurbs {#kickstarter-blurbs}
```{r echo=FALSE}
kickstarter <- read_csv("data/kickstarter.csv.gz")
```
The crowdfunding site [Kickstarter](https://www.kickstarter.com/) provides people a platform to gather pledges to "back" their projects, such as films, music, comics, journalism, and more. When setting up a campaign, project owners submit a description or "blurb" for their campaign to tell potential backers what it is about. The data set of campaign blurbs used in this book [was scraped from Kickstarter](https://webrobots.io/kickstarter-datasets/); the blurbs used here for modeling are from `r min(kickstarter$created_at)` to `r max(kickstarter$created_at)`, with a total of `r scales::comma(nrow(kickstarter))` campaigns in the sample. For each campaign, we know its `state`, whether it was successful in meeting its crowdfunding goal or not.
This data set is used in Chapters \@ref(dldnn), \@ref(dllstm), and \@ref(dlcnn).