-
Notifications
You must be signed in to change notification settings - Fork 17
/
day-3-childes-db.Rmd
189 lines (133 loc) · 6.98 KB
/
day-3-childes-db.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: 'Day 3: CHILDES and childes-db'
author: "Mike Frank"
date: "2023-01-10"
output: html_document
---
In this Markdown, we'll dig into data from CHILDES using childes-db! To retrieve data, you'll need to use the `childesr` package, which is available from CRAN.
```{r eval=FALSE}
# run this only once to install childesr
install.packages("childesr")
```
```{r}
library(childesr)
library(tidyverse)
```
# Introducing `childesr`
As with `wordbankr`, the majority of what is contained in the `childesr` package is `get_X` functions for getting the various tables in the database.
```{r}
ls("package:childesr")
```
CHILDES is organized hierarchically into collections, which contain corpora, and each of these in turn contains individual transcript files.
Let's look at the available collections!
```{r}
get_collections()
```
And here are the available corpora:
```{r}
corpora <- get_corpora()
corpora |>
filter(collection_name == "Eng-NA")
```
As you can see, CHILDES is an impressive resource with a LOT of different data sources. These data sources an be quite idiosyncratic. As an example, the "frogs" collection refers to transcripts of children telling "frog stories", narrating a wordless picture book, across different languages.
Here, we will focus on looking at North American English corpora, contained in the "Eng-NA" collection. Where appropriate, `get_X` commands in `childesr` let you download subsets of data by specifying which collection and/or corpus you want to query (much like how `wordbankr` lets you specify a language and instrument). This keeps you from downloading lots of irrelevant data.
```{r}
get_transcripts(collection = "Eng-NA")
```
We'll look at the famous Brown corpus (the one that started it all!). There are three children in this corpus. `get_transcripts` gives a list of files in this corpus, with the child's age and name.
```{r}
# X <- Y "assignment" or "put Y into X"
# X = Y: "parameter/argument" - "set parameter/argument X to Y for a function"
# X == Y: "equivalence" - "is X equal to Y?"
get_transcripts(corpus = "Brown")
```
`get_participants` also will let you look at the children (and other adults) in the corpus. Note that many files have lots of people present.
```{r}
get_participants(corpus = "Brown")
```
# childes-db: Tabular formats for language data
You might not not think that tables are a good format for corpus data, but it turns out that the same things that make tidy data useful for other purposes make it useful for dealing with language. The key is that we maintain tables in `childesr` at several levels: full utterances, word types (along with frequency counts), and tokens (individual words). Let's look at each.
First, let's look at utterances. Note I've put an age restriction here, because otherwise this command would take a long time to run because we're essentially retrieving the entire Brown corpus!
```{r}
adam_utterances <- get_utterances(corpus = "Brown", age = c(27, 29), role = "target_child", target_child = "Adam")
adam_utterances |>
filter(id == 1763944)
```
Sometimes we want whole utterances but sometimes we want to focus on specific words or groups of words. Let's look at the word type "dog", as spoken by the children in Brown.
```{r}
dog_types <- get_types(corpus = "Brown", type = "dog", role = "target_child")
sum(dog_types$count)
```
Here we have dog counts for each transcript in which the word was spoken (no zero counts returned).
Finally, let's get instances of "dog" (individual tokens) in Brown.
```{r}
get_tokens(corpus = "Brown", role = "target_child", token = "dog")
```
EXERCISE: You can also get variants on an individual word. Let's use this example to look at the development of the plural. We'll request singular and plural "dog" tokens by asking for dog as our stem. Of course, we'll also get some others.
```{r}
dogs <- get_tokens(corpus = "Brown",
role = "target_child",
token = "*",
stem = "dog")
dogs
```
Now let's look use our group-by/summarize workflow to try and understand when the plural of dog emerges relative to the singular for each child. Make a tibble showing the first age at which the words "dog" and "dogs" are used for each child.
Hint: `filter`, `group_by`, `summarise`!
```{r}
dog_singulars <- dogs |>
filter(part_of_speech == "n", gloss %in% c("doggie","dog","doggy")) |>
group_by(target_child_name) |>
summarise(min_age_sg = min(target_child_age))
dog_plurals <- dogs |>
filter(part_of_speech == "n", gloss %in% c("doggies","dogs")) |>
group_by(target_child_name) |>
summarise(min_age_pl = min(target_child_age))
left_join(dog_singulars, dog_plurals) |>
mutate(delay = min_age_pl - min_age_sg)
# alternate solution
dogs |>
filter(gloss %in% c("dog","dogs")) |>
group_by(speaker_name, gloss) |>
summarise(first_age = min(target_child_age))
```
# Computing speaker statistics
Many users of CHILDES are interested in measures of syntactic development. One of the most famous of these is the "mean length of utterance" (MLU). MLU can be computed in either morphemes (traditional) or words (easier).
Another set of measures captures *lexical diversity*, the breadth of vocabulary that the child uses. In practice there are a number of measures, from the simple type:token ratio (which can be biased) to more complex ones like MTLD.
These statistics are computationally a little complicated, so they are cached in their own table, called `speaker_statistics`.
```{r}
brown_stats <- get_speaker_statistics(corpus = "Brown", role = "target_child")
brown_stats
```
EXERCISE: Using the speaker statistics table, plot MLU by child age, with color showing the different children.
```{r}
ggplot(brown_stats,
aes(x = target_child_age,
y = mlu_w,
colour = target_child_name)) +
geom_point() +
geom_smooth()
```
# And/or counts
In our final exercise, we'll replicate the general patterns of development for "and" and "or" found by Jasbi et al. (2022).
EXERCISE: use `get_tokens` to 1) get all instances of "and" and "or" spoken by either the child or the child's mother in the Brown corpus and 2) in a separate dataframe get ALL the tokens spoken by the child and the mother (you'll use this to normalize the counts so as to be able to compare transcripts of different sizes).
```{r}
and_or_tokens <- get_tokens(..)
all_tokens <- get_tokens(...)
```
Now summarise the total number of tokens of each word for each speaker and transcript, and create normalized counts.
```{r}
and_or_counts <- and_or_tokens |>
# ...
all_token_counts <- all_tokens |>
# ...
# now join and normalize and/or counts by total counts
and_or_freqs <- left_join(...)
```
Now let's plot the result:
```{r}
ggplot(and_or_freqs, aes(x = target_child_age, y = prop, col = speaker_role)) +
geom_point() +
geom_smooth() +
facet_grid(gloss~target_child_name, scales = "free_y")
```
The pattern within children looks quite a lot like the pattern across children, especially for "and", though these three children don't lag quite as much on "or".