-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-workflow.Rmd
343 lines (238 loc) · 67.5 KB
/
02-workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
# (PART) The cloudspotter's toolkit {-}
# From corpora to clouds {#workflow}
The main goal of the methodological framework presented here is to explore semasiological structure from textual data.
The starting point is a corpus, i.e. a selection of texts, and one of the most tangible outputs is what we will call *clouds*: the visual representation of textual patterns as dense areas in a `r sc("2d")` scatterplot.
In this chapter we will explain how to generate clouds from the raw, seemingly indomitable ocean of a corpus.
First, we will describe how token-level vector space models are created: these are mathematical representations of the occurrences of a lexical item.
We will focus on context-counting models, but this is by no means the only viable path.
Other techniques, such as BERT [@BERT]^[See also @devries.etal_2019 for a Dutch version.], that can also generate vectors for individual instances of a word, could be used for the first stage of this workflow.
<!-- TODO explain why we don't use them -->
Section \@ref(vector-creation) will describe the process and the rationale without assuming a strong mathematical background for the reader, leaving the deeper technicalities to Section \@ref(formulae). In Section \@ref(params), we will break apart the workflow into the multiple choices that the researcher needs to make and that result in a potentially infinite number of models, while Section \@ref(pam) briefly presents a method to select a few representative models. Finally, Section \@ref(workflow-summary) summarizes the chapter.
## A cloud machine {#vector-creation}
At the core of vector space models, *aka* distributional models, we find the Distributional Hypothesis, which is often linked to Harris's observation that "difference of meaning correlates with difference of distribution" [-@harris_1954,156], but also to Firth's "You shall know a word by the company it keeps" [-@firth_1957a,11] and Wittgenstein's "the meaning of a word is its use in the language"^[The famous quote is preceded by an appropriate nuance: "For a *large* class of cases --- though not for all --- in which we employ the word 'meaning' it can be defined thus: the meaning of a word is its use in the language".
<!-- Proposition 43, which fully reads: "Man kann für eine *große* Klasse von Fällen der Benützung des Wortes 'Bedeutung' --- wenn auch nicht für *alle* Fälle seiner Benützung --- dieses Wort so erklären: Die Bedeutung eines Wortes ist sein Gebrauch in der Sprache.". -->
] [-@wittgenstein_1958,20].
In other words, items that occur in similar contexts in a given corpus will be semantically similar, while those that occur in different contexts will be semantically different [@jurafsky.martin_2020, Ch. 6; @lenci_2018]. Crucially, this does not imply that we can describe an individual item with their distributional properties, but that comparing the distribution of two items can tell us something about their semantic relatedness [@sahlgren_2006,19].
```{r, qlvlfreqs, include = FALSE}
topfreqs <- round(qlvlcorp$head$freq / 1000000, 1)
top4 <- round(qlvlcorp$head$cumsum2[[4]]*100)
```
@firth_1957a inspired generations of corpus linguists to look at collocations as part of the semantic description of a lemma. The Birmingham school, pioneered by John Sinclair, used co-occurrence frequency information to describe a lexical item
<!-- Most often word forms, in English; see @sinclair_1991 [29] and; @stubbs_1995 [23-24]) -->
by the set of those context words most attracted to them. Due to the skewed distribution of word frequencies, known as Zipf's law, this attraction cannot be measured in terms of raw co-occurrence frequencies. For example, the most frequent lemma in the (Dutch) corpus used for this research, discarding punctuation, is *de* 'the (fem./masc.)', which occurs `r topfreqs[[1]]` million times. The second most frequent lemma, *van* 'from', occurs `r topfreqs[[2]]` milion times, and it is followed by *het* 'the (neutral)' and *een* 'a, an', with corresponding frequencies of `r topfreqs[[3]]` and `r topfreqs[[4]]` million times each. For every 100 words in the corpus, excluding punctuation, $`r top4`$ are one of these four words. Of the total of `r round(qlvlcorp[["total_types"]]/1000000, 1)` million different words, `r round(qlvlcorp$hapax_legomena/qlvlcorp$total_types*100, 1)`% are *hapax legomena*, i.e. they occur once, and 172 lemmas cover 50% of all the occurrences. As a consequence, co-occurrences with very frequent words are not as informative as those with less frequent words, and hence raw co-occurrence frequencies are transformed to measures of **association strength**, such as mutual information (see Section \@ref(pmi)) or t-score, among others [for an overview see @evert_2009; @gablasova.etal_2017].
In collocational studies, researchers typically set a threshold of association strength and only look at the context words that surpass it.
At their core, context-counting **vectors** are lists of association strength values. Each word is represented by its association strength to a long array of words that it might co-occur with, as shown in Table \@ref(tab:vec1). Unlike in collocation studies, low values --- or even lack of co-occurrence --- are not excluded, but used in the comparison with other words that might. Going back to the Firthian motto, a collocational study would describe me with the list of people that I talk to the most, whereas a distributional model would compare me to someone else based on who either of us talks to and how often we talk to them. The more people we have in common, the more similar we are, but people that neither of us talks to have no impact on the comparison.
Table \@ref(tab:vec1) shows small vectors representing the English nouns *linguistics*, *lexicography*, *research* and *chocolate*, as well as the adjective *computational*, with co-occurrence information obtained from the GloWbE (Global Word-based English) corpus. The values are their association strength `r sc("pmi")` with each of the lemmas in the columns: the higher the values, the stronger the attraction between the word in the row and the word in the column (See Section \@ref(pmi)). From a collocational perspective, *linguistics* is strongly attracted to both *language* and *English*, i.e. they occur very often in a span of 10 words from each other, considering their individual frequencies; it is less attracted to *word* and *to speak*, and does not co-occur with either *to eat* or *Flemish* within that window, in this corpus.
(ref:vec1) `r str_glue("Small example of type-level vectors, with {sc('pmi')} values based on a symmetric window of 10. Frequency data extracted from GloWbE.")`
```{r, vec1}
vex <- read_csv(here("assets", "vector-examples", "vectorexample.csv"), col_types = cols()) %>%
mutate(across(where(is.numeric), round, 2)) %>%
mutate(across(where(is.numeric), na2zero))
kable(vex,
caption = "(ref:vec1)",
booktabs = T) %>%
kable_paper(latex_options = c("scale_down")) %>%
footnote(general = r"(Part-of-speech is indicated after a slash: n = noun, j = adjective, v = verb)",
threeparttable = T)
```
Each row in Table \@ref(tab:vec1) is a vector coding the distributional information of the lemma it represents. By **lemma** we refer to the combination of a stem and a part of speech, e.g. *chocolate/n* covers *chocolate*, *chocolates*, *Chocolate*, etc.
<!-- This approach is different from the dominant tendency in computational linguistics, where vectors would represent word forms^[See @turney.pantel_2010 155 and @kiela.clark_2014 25. for a brief discussion, but more crucially: this is not always made explicit in papers on distributional methods. Curiously, Normally this is not even made explicit, but the few examples of lists of context words show that it is the case]. Identifying lemmas requires some corpus processing, namely stemming and part-of-speech tagging, but makes more sense from a lexicographical point of view (see Section \@ref(params) for a discussion). -->
<!-- Regardless, the vectors still represent forms, not meanings: the vector of *linguistics* codes co-occurrence frequency of *Linguistics* and *linguistics*, but that is not equal to the concept `r sc("linguistics")`. -->
<!-- More importantly, we must remember that the vectors represent a distributional profile of the lemma, which is automatically extracted from a corpus, not of a meaning [Cf. @bolognesi_2020 82]. -->
These vectors are meant to code the distributional behaviour of the linguistic forms they represent --- in this case lemmas ---, in order to operationalize the notion of distributional similarity and, consequently, model their meaning. For example, in Table \@ref(tab:vec1) the first two rows, representing *linguistics* and *lexicography*, are similar to each other: both words have a similar attraction to *language* and to *English*, even if the values for *word* and *to speak* are more different. More importantly, they are more similar to each other than to other rows in the table, which have lower values for those four columns and might even co-occur with *Flemish* and *to eat* as well.
The Distributional Hypothesis expresses the observation that words that are distributionally similar, like *linguistics* and *lexicography*, are semantically similar or related, whereas words that are distributionally different, like *linguistics* and *chocolate*, are semantically different or unrelated.
The rows in this table are **type-level vectors**: each of them aggregates over all the attestations of a given lemma in a given corpus to build an overall profile. As a result, it collapses the internal variation of the lemma, i.e. its different senses or semasiological structure. In order to uncover such information, we need to build vectors for the individual instances or **tokens**, relying on the same principle: items occurring in similar contexts will be semantically similar. For instance, we might want to model the three (artificial) occurrences of *study* in (@ex1) through (@ex3), where the target item is in bold and some context words are in italics.
(@ex1) Would you like to **study** *lexicography*?
(@ex2) They **study** this in *computational linguistics* as well.
(@ex3) I eat *chocolate* while I **study**.
Given that, at the aggregate level, a word can co-occur with thousands of different words, type-level vectors can include thousands of values. In contrast, token-level vectors can only have as many nonzero values as the individual window size comprises, which drastically reduces the chances of overlap between vectors. In fact, the three examples don't share any item other than the target. As a solution, inspired by @schutze_1998, (a selection of) the context words around the token is replaced with their respective type-level vectors [@heylen.etal_2012; @heylen.etal_2015; @depascale_2019].
Concretely, example (@ex1) is represented by the vector for its context word *lexicography*, that is, the second row in Table \@ref(tab:vec1); example (@ex2) by the sum of the vectors for *linguistics* (row 1) and *computational* (row 3); and example (@ex3) by the vector for *chocolate* (row 5). This not only solves the sparsity issue, ensuring overlap between the vectors, but also allows us to find similarity between (@ex1) and (@ex2) based on the similarity between the vectors for *lexicography* and *linguistics*. As we will see in Section \@ref(params), we can even use the association strength between the context words and the target type, i.e. *to study*, and give more weight to the context words that are more characteristic of the lemma we try to model.
The result of this procedure is a co-occurrence matrix like the one shown in Table \@ref(tab:tokens). Each row represents an instance of the target lemma, e.g. *to study*, and each column, a lemma occurring in the corpus^[Which lemmas in particular are a matter for Section \@ref(soc), but in any case, lemmas that do not co-occur with any of the context words of the tokens will have zeros in all the rows and therefore be dropped.]; the values are the (sum of the) association strength between the words that occur around the token, i.e. their **first-order context words**, and each of the words in the columns, i.e. the **second-order context words**. In addition, all negative and missing values value have been set to zero, due to the unreliability of negative `r sc("pmi")` values (see Section \@ref(pmi)).
(ref:tokens) `r str_glue("Small example of token-level vectors of three artificial instances of *to study*.")`
```{r, tokens}
tokens <- read_csv(here("assets", "vector-examples", "tokensexample.csv"), col_types = cols()) %>%
mutate(across(where(is.numeric), round, 2)) %>%
mutate(token_id = str_extract(target, "[0-9]"), target = str_replace(r"(study$_n$)", "n", token_id)) %>% select(-token_id)
tokens %>%
mutate(across(where(is.numeric), pmi2positive)) %>%
kable(caption = "(ref:tokens)",
booktabs = T,
escape = F) %>%
kable_paper(latex_options = c("scale_down"))
```
The next step in the workflow is to compare the items to each other. We can achieve this by computing cosine distances between the vectors (see Section \@ref(cosine) for the technical description). The resulting distance matrix, shown in Table \@ref(tab:tokdists), tells us how different each token is to itself, which takes the minimum value of 0, and to each of the other tokens, with a maximum value of 1. We can see that (@ex1) and (@ex2) are very similar to each other, because they co-occur with similar context words, i.e. *linguistics* and *lexicography*, but drastically different from (@ex3), which was modelled based on *chocolate*. The specific selection of context words is crucial: if we had selected *computational* but not *lexicography* to model (@ex2), it would have resulted in a larger difference with (@ex1). The series of choices that we can make and that have been made for this research project are discussed in Section \@ref(params).
(ref:tokdists) `r str_glue("Cosine distance matrix between the three artificial instances of *to study*.")`
```{r, tokdists}
tokdists <- tokens %>%
mutate(across(where(is.numeric), pmi2positive)) %>%
data.frame(row.names = "target") %>%
as.matrix() %>% t() %>%
lsa::cosine()
kable(round(1-tokdists, 2),
caption = "(ref:tokdists)",
booktabs = T, escape = F) %>%
kable_paper()
```
Table \@ref(tab:tokdists) is small and simple, but what if we had hundreds of tokens? The more items we compare to one another, the larger and more complex the distance matrix becomes. In order to interpret it, we need more stages of processing. On the one hand, dimensionality reduction techniques such as `r sc('mds')`, t-`r sc('sne')` and `r sc('umap')`, which will be discussed in Section \@ref(dim-reduction), offer us a way of visualizing the distances between all the models by projecting them to a `r sc("2d")` space. We can then represent each model into a scatterplot, like in the plots of Figure \@ref(fig:cloud1), where each point represents a token, and their distances in `r sc("2d")` space approximate their distances in the multidimensional space of the co-occurrence matrix. Visual analytics, such as the tool described in Chapter \@ref(nephovis), can then help us explore the scatterplot to figure out how tokens are distributed in space, why they form the groups they form, etc.
Word Sense Disambiguation
<!-- TODO add references -->
makes use of clustering algorithms to extract clusters of similar tokens from their models. The idea behind it is that, if distributional similarity correlates with semantic similarity, groups of similar tokens should share the same sense and have a different sense from other groups of tokens. In Chapter \@ref(semantic-interpretation) we will see to what degree this assumption holds in this data and with these methods.
The final step in our workflow is, then, the combination of dimensionality reduction and clustering, which results in the right plot of Figure \@ref(fig:cloud1): by means of dimensionality reduction, tokens are located in a scatterplot so that distributional similarities are approximated as spatial similarities, and groups of similar tokens are assigned different colours.
In previous research, which did not integrate clustering procedures in this manner, the term *cloud* was used to refer to a full model [@heylen.etal_2015; @wielfaert.etal_2019; @depascale_2019; @montes.heylen_2022]. In this study, instead, *cloud* will refer to each of the clusters, identified by colours in the scatterplot.
(ref:cloud1) `r sc("2d")` representation of Dutch *hachelijk* 'dangerous/critical'.
```{r, cloud1, fig.cap = '(ref:cloud1)'}
plot_grid(
plotNude(d$hachelijk$medoidCoords[[1]]$coords),
plotBasic(d$hachelijk$medoidCoords[[1]]))
```
## The chemistry of cloud making {#formulae}
A typical vector space model is an item-by-feature matrix: its rows code items, its columns code features, and its cells code information related to the frequency with which the items and features co-occur. The first distributional models counted the occurrences of words in documents and represented them in word-by-document matrices; the models described here are token-by-feature matrices, in which the rows are attestations of a lexical item and the features are second-order co-occurrences, i.e. context words of the context words of the token. @turney.pantel_2010 offer an overview of different kinds of matrices, based on the items modelled and the features used to describe them.
Besides matrices, vector space models can be tensors, which are generalisations of matrices for more dimensions and can allow for more complex interactions, e.g. subject-verb-object triples in @vandecruys.etal_2013; see also @lenci_2018.
Models can be based on co-occurrence counts, as is the case in these studies, or on machine-learning algorithms trained to predict the context around a word or fill in an empty slot given a few words around it. These context-predicting models use the weights of their neural networks as features in the vectorial representations of the words they predict. A number of papers have explored which kind of models work best for different tasks, with uncertain results [@baroni.etal_2014; @levy.etal_2015].
As explained before, such context-predicting models will not be explored in this dissertation, although their integration would be interesting for further avenues of research.
The workflow described in the previous section relies on mathematical principles to obtain linguistic patterns from a (mostly) raw corpus. A full understanding of the formulae that underlie each step is not necessary to grasp the gist of this methodology, but it is required for an appropriate implementation.
In this section, we will take a deeper look into the technical aspects the machinery behind the process, in particular association strengths, similarity metrics, dimensionality reduction techniques and clustering algorithms.
### Association strength: PMI {#pmi}
The distribution of words in a corpus follows a power law: a few items are extremely frequent, and most of the items are extremely infrequent. Association measures transform raw frequency information to measure the attraction between two items while taking into account the relative frequencies with which they occur. They typically manipulate, in different ways, the frequency of the node $f(n)$, the frequency of its collocate $f(c)$, their frequency of co-occurrence $f(n,c)$ and the size of the corpus $N$. @evert_2009 and @gablasova.etal_2017 offer an overview of how different measures are computed and used in corpus linguistics; @kiela.clark_2014
<!-- And anyone else? -->
compare measures used in distributional models.
In the studies discussed here, I will only use **(positive) pointwise mutual information**, or `r sc("(p)pmi")` [@church.hanks_1989], one of the most popular measures both in collocation studies and distributional semantics [@bullinaria.levy_2007; @kiela.clark_2014; @jurafsky.martin_2020; @lapesa.evert_2014].
Its formula is shown in equation \@ref(eq:pmi), where $p(n) = \frac{f(n)}{N}$, i.e. the proportion of occurrences in the corpus that correspond to $n$.
\begin{equation}
I(n, c) = \log \frac{p(n,c)}{p(n)p(c)} = \log \left( \frac{f(n,c)}{f(n)f(c)} N \right)
(\#eq:pmi)
\end{equation}
<!-- $$PMI(n, c) = \log \frac{p(n, c)}{p(n)p(n)}$$ -->
Negative `r sc("pmi")` values tend to be unreliable, so positive `r sc("pmi")` or `r sc("ppmi")` is used, in which the negative `r sc("pmi")` values are turned to zeros [@bullinaria.levy_2007; @kiela.clark_2014; @depascale_2019; @jurafsky.martin_2020,109].
Furthermore, `r sc("pmi")` is known for its bias towards infrequent events: when either $p(n)$ or $p(c)$ is very low, `r sc("pmi")` tends to be very high. In collocation studies, this bias may be counteracted by combining `r sc("pmi")` filters with other measures that favour frequent co-occurrences, such as t-scores or log-likelihood ratio [@mcenery.etal_2010]. In distributional semantics, the accuracy of models that rely on `r sc("ppmi")` seems not to be affected by the issue presented by this bias;
moreover, in these studies any lemma with $f(n) < 217$, i.e. occurring less than once every two million tokens, was excluded, to avoid too sparse, uninformative vectors.
### Similarities and distances: cosine {#cosine}
After obtaining the token-by-feature matrices, the distances between the vectors must be computed. Typically, the implementations for the dimensionality reduction and clustering can take the item-by-feature matrices as input and compute the distances under-the-hood, but they do not necessarily offer the option of computing our distance measure of choice, cosine.
Cosine is a measure of similarity between vectors $\mathbf{v}$ and $\mathbf{w}$ and is defined in equation \@ref(eq:cosine); it coincides with the normalised dot product of the vectors [@jurafsky.martin_2020,105].
\begin{equation}
\mathrm{cosine}(\mathbf{v}, \mathbf{w}) = \frac{\mathbf{v} \cdot \mathbf{w}}{\left|\mathbf{v}\right|\left|\mathbf{w}\right|} = \frac{\sum\limits_{i=1}^N v_iw_i}{\sqrt{\sum\limits_{i=1}^N v_i^2}\sqrt{\sum\limits_{i=1}^N w_i^2}}
(\#eq:cosine)
\end{equation}
For positive values, e.g. when using `r sc("ppmi")`, the cosine similarity ranges between 0 and 1: it will be 1 between identical vectors and 0 for orthogonal vectors, which do not share nonzero dimensions, like study$_1$ and study$_3$ in Table \@ref(tab:tokens). Cosine is sensitive to the angle between the vectors, and not to their magnitude: the similarity between study$_1$ and a vector created by multiplying all the cells in study$_1$ by any constant will still be 1.
Cosine similarity is the most common metric in distributional models [@jurafsky.martin_2020,105] and has been shown to outperform other measures, especially when combined with `r sc("ppmi")` [@kiela.clark_2014; @lapesa.evert_2014; @bullinaria.levy_2007]^[@kiela.clark_2014 also recommend a Correlation similarity.]. One of the ways in which it is used is for semantic similarity tasks: the nearest neighbours of an item are extracted, by selecting the vectors with highest cosine similarity to the target vector. In these studies, similarities are usually transformed to distances by inverting the scale ($\mathrm{cosine}_{\mathrm{dist}} = 1- \mathrm{cosine}_{\mathrm{sim}}$), so that identical vectors --- and each vector to itself --- have a cosine distance of 0 and orthogonal vectors have a cosine distance of 1, as shown in Table \@ref(tab:tokdists).
Before applying dimensionality reduction or clustering algorithms, the cosine distances were further transformed with the aim of giving more weight to short distances, i.e. nearest neighbours, and decreasing the impact of long distances. For each token vector $\mathbf{v}$ with $n$ dimensions, we define the transformed vector $\mathbf{v}_{\mathrm{transformed}}$ as $\mathbf{v}_{\mathrm{transformed}_i} = \log (1 + \log rank(\mathbf{v})_i)$ for each $i$, with $1 \le i \le n$, and where $rank(\mathbf{v})_i$ is the similarity rank of the $i$th value in $\mathbf{v}$. For example, if originally we have the distances $\mathbf{v} = [0, 0.2, 0.8, 0.3]$, the rank transformation returns $rank(\mathbf{v}) = [1, 2, 4, 3]$, which after the first logarithm transformation becomes $[0, 0.693, 1.39, 1.099]$ and, after the second transformation, $\mathbf{v}_{\mathrm{transformed}} = [0, 0.52, 0.86, 0.74]$. On the one hand, the magnitude of the distance is not as important as its ranking among the nearest neighbours. On the other, the lower the ranking, the smaller the impact: the difference between the final values for ranks 1 and 2 is larger than between ranks 2 and 3.
The new matrix, where each row $\mathbf{v}$ has been replaced with its $\mathbf{v}_{\mathrm{transformed}}$, is converted to euclidean distances.
While cosine distances are used to measure the similarity between token-level vectors, euclidean distances will be used to compare two vectors of the same token across models, and thus compare models to each other. Concretely, let's say we have two matrices, $\mathbf{A}$ and $\mathbf{B}$, which are two models of the same sample of tokens, built with different parameter settings, and we want to know how similar they are to each other, i.e. how much of a difference those parameter settings make. Their values are already transformed cosine distances. A given token $i$ has a vector $\mathbf{a}_i$ in matrix $\mathbf{A}$ and a vector $\mathbf{b}_i$ in matrix $\mathbf{B}$. For example, $i$ could be example (@ex2) above, and its vector in $\mathbf{A}$ is based on the co-occurrence with *computational* and *linguistics*, as shown in Table \@ref(tab:tokens), while its vector in $\mathbf{B}$ is only based on *computational*. The euclidean distance between $\mathbf{a}_i$ and $\mathbf{b}_i$ is computed with the formula shown in equation \@ref(eq:eucli). After running the same comparison for each of the tokens, the distance between the models $\mathbf{A}$ and $\mathbf{B}$ is then computed as the mean of those tokenwise distances across all the tokens modelled by both: $d(\mathbf{A},\mathbf{B}) = \frac{\sum_{i=1}^nd(\mathbf{a}_i, \mathbf{b}_i)}{N}$. Alternatively, the distances between models could come from procrustes analysis^[Run with `vegan::procrustes()` [@R-vegan]], like @wielfaert.etal_2019 do, which has the advantage of returning a value between 0 and 1. However, this method is much faster and returns comparable results.
\begin{equation}
d(\mathbf{a}_i, \mathbf{b}_i) = \sqrt{\sum\limits_{i=j}^n(a_j-b_j)^2}
(\#eq:eucli)
\end{equation}
### Dimensionality reduction for visualization: t-SNE {#dim-reduction}
Dimensionality reduction algorithms try to reduce the number of dimensions of a high-dimensional entity while retaining as much information as possible. In distributional models, they have two main applications: one that reduces thousands of dimensions to a few hundred, and one that reduces them to only two.
The first application of dimensionality reduction, with techniques like `r sc("svd")` (Singular Value Decomposition), is meant to deal with the sparsity of high-dimensional vectors. Due to the frequency distribution discussed above, many words never occur in the vicinity of each other, resulting in many zeros in their context-counting representations and therefore inflated differences between the vectors. In particular, techniques like Latent Semantic Analysis [@landauer.dumais_1997] are based on the observation that the dimensions obtained from this process are semantically interpretable.
<!-- <!-- NOTE add authors, Sahlgren, Van De Cruys... -->
<!-- <!-- @vandecruys_2008 tried both SVD and non-negative matrix factorization on type-level BOW-based models with whole paragraphs as windows but they did not bring any improvement. -->
It could also be used for token-level spaces, but the comparisons discussed in @depascale_2019 [246] indicate that they don't necessarily perform better than non reduced spaces.
<!-- Also check Van De Cruys' work, some LSA? Remember: Kiela+Clark do not look into this-->
Both dimensionality reduction techniques and neural networks are suggested as ways of condensing very long, sparse vectors [@jurafsky.martin_2020; @bolognesi_2020].
We will not go into the technical aspects because these techniques have not been implemented in the studies described here. Instead, we have compared vectors of different lengths based on other selection methods for the second-order features. Combining them with `r sc("svd")` is a possible avenue for future comparisons.
The second application of dimensionality reduction is used for visualization purposes. A token-by-feature matrix can be understood as a multidimensional space: each of the columns is a dimension of space and the values of cells are the coordinates of the items in each of these dimensions. That is why we can use cosine distances, which measures angles: if we draw a vector from the origin (zero in all dimensions) to the point with those coordinates, it diverges from other vectors with a given angle that grows wider as the vectors diverge, leading to larger cosine distances. We can mentally picture or even draw positions, vectors and angles in up to 3 dimensions, but distributional models have hundreds if not thousands of dimensions. These applications of dimensionality reduction, then, are built to project the distances between items in the multidimensional space to euclidean distances in a low-dimensional space that we can visualize. The different implementations can receive the token-by-feature matrix as input, but will not typically compute cosine distances between the items, so the distance matrix is provided as input instead.
The literature tends to go for either multidimensional scaling (`r sc("mds")`) or t-stochastic neighbour embeddings (t-`r sc("sne")`); recently, an interesting alternative called `r sc("umap")` has been introduced, which I'll discuss shortly.
The first option, `r sc("mds")`, is an ordination technique, like principal components analysis (`r sc("pca")`). It has been used for decades in multiple areas [e.g. @cox.cox_2008]; its most relevant application for this case, non-metric multidimensional scaling, was developed by @kruskal_1964. It tries out different low-dimensional configurations aiming to maximize the correlation between the pairwise distances in the high-dimensional space and those in the low-dimensional space: items that are close together in one space should stay close together in the other, and items that are far apart in one space should stay far apart in the other.
The output from `r sc("mds")` can be evaluated by means of the stress level, i.e. the complement of the correlation coefficient: the smaller the stress, the better the correlation between the measures.
Unlike `r sc("pca")`, however, the dimensions are not meaningful *per se*; two different runs of `r sc("mds")` may result in plots that mirror each other while representing the same thing. Nonetheless, the R implementation `vegan::metaMDS()` [@R-vegan] rotates the plot so that the horizontal axis represents the maximum variation.
In cognitive linguistics literature both metric [@koptjevskaja-tamm.sahlgren_2014; @hilpert.correiasaavedra_2017; @hilpert.flach_2020]
and non-metric `r sc("mds")` [@heylen.etal_2012; @heylen.etal_2015; @perek_2016; @depascale_2019] have been used.
The second technique, t-`r sc("sne")` [@Rtsne2008; @Rtsne2014], has also been incorporated in cognitive distributional semantics [@perek_2018; @depascale_2019].
It is also popular in computational linguistics [@smilkov.etal_2016; @jurafsky.martin_2020]; in R, it can be implemented with `Rtsne::Rtsne()` [@R-Rtsne].
The algorithm is quite different from `r sc("mds")`: it transforms distances into probability distributions and relies on different functions to approximate them. Moreover, it prioritises preserving local similarity structure rather than the global structure: items that are close together in the high-dimensional space should stay close together in the low-dimensional space, but those that are far apart in the high-dimensional space may be even farther apart in low-dimensional space.
Compared to `r sc("mds")`, we obtain nicer, tighter clouds (see Figure \@ref(fig:drviz)), but the distance between them is less interpretable: even if we trust that tokens that are very close to each other are also similar to each other in the high-dimensional space, we cannot extract meaningful information from the distance *between* these groups.
In addition, it would seem that points that are far away in a high-dimensional space might show up close together in the low-dimensional space [@oskolkov_2021].
In contrast, Uniform Manifold Approximation and Projection, or `r sc("umap")` [@mcinnes.etal_2020], penalizes this sort of discrepancies. It would be an interesting avenue for further research, but a test on the current data did not reveal substantial improvements between t-`r sc("sne")` and `r sc("umap")` that would warrant the replacement of the technique within the duration of this project (see Figure \@ref(fig:drviz) for an example with default parameters. In other models, differences include longer shapes). Other known advantages such as increased speed were not observed in the small samples under consideration --- in fact, the R implementation of `r sc("umap")` [@R-umap] was even slower.
(ref:drviz) Two `r sc("2d")` representations of the same model of *hachelijk* 'dangerous/critical': `r nameModel(names(d$hachelijk$medoidCoords)[[1]])`. Non-metric `r sc("mds")` on the top left, t-`r sc("sne")` to its right and `r sc("umap")` at the bottom. Colours indicate `r sc("hdbscan")` clusters.
```{r, drviz, fig.cap = '(ref:drviz)', fig.height = 6}
hach_mds <- read_tsv(here("assets", "hachelijk.mds.tsv"), col_types = cols())
hach_umap <- read_tsv(here("assets", "hachelijk.umap.tsv"), col_types = cols())
top_grid <- plot_grid(
plotBasic(hach_mds),
plotBasic(d$hachelijk$medoidCoords[[1]]),
labels = c("NMDS", "t-SNE"), label_fontfamily = "Bookman Old Style")
plot_grid(
top_grid, plotBasic(hach_umap),
labels = c("", "UMAP"),
label_fontfamily = "Bookman Old Style", ncol = 1, hjust = -4)
```
Unlike `r sc("mds")`, t-`r sc("sne")` requires setting a parameter called **perplexity**, which roughly indicates how many neighbours the preserved local structure should cover. Low values of perplexity lead to numerous small groups of items, while higher values of perplexity return more uniform, round configurations [@wattenberg.etal_2016]. I have explored perplexity values of 10, 20, 30 and 50, and for this dataset 30 --- the default value in the R implementation --- has proved to be the most stable and meaningful. Unless otherwise stated, the figures in this text --- including Figure \@ref(fig:cloud1) --- will illustrate t-`r sc("sne")` token-level representations with perplexity of 30. To represent distances between models, instead, non-metric `r sc("mds")` is used (only in Section \@ref(nepho1)).
For both `r sc("mds")` and t-`r sc("sne")` we need to state the desired number of dimensions before running the algorithm --- for visualization purposes, the most useful choice is 2. Three dimensions are difficult to interpret if projected on a `r sc("2d")` space, such as a screen or paper [@card.etal_1999; @wielfaert.etal_2019].
<!-- I think UMAP didn't need to state it beforehand? CHECK -->
As we mentioned before, the dimensions themselves are meaningless, hence no axes or axis ticks will be included in the plots. However, the scales of both coordinates are kept fixed: given three points $a=(1, 1.5)$, $b=(1, 0.5)$ and $c=(0, 1.5)$, the distance between $a$ and $b$ (1 unit along the $x$-axis) will be the same as the distance between $a$ and $c$ (1 unit along the $y$-axis).
### Clustering: HDBSCAN {#hdbscan}
In word sense disambiguation tasks, the vectorial representations of different attestations are clustered into groups of similar tokens. There is a variety of clustering algorithms, appropriate for different kinds of data and structures. I will not offer an overview of the options, but only describe the techniques used in these studies. This section is dedicated to `r sc("hdbscan")`, the algorithm that returns the coloured clusters in Figures \@ref(fig:cloud1) and \@ref(fig:drviz). Section \@ref(pam) will discuss `r sc("pam")`, which will be uses to select representative models.
Hierarchical Density-Based Spatial Clustering of Applications with Noise, `r sc("hdbscan")` for the friends [@campello.etal_2013], is a clustering algorithm, i.e. a procedure to identify groups of similar items that are different from other groups. Unlike its better-known cousins, it does not try to place *all* the items in the sample in different groups, but instead assumes that the dataset might be noisy and that the items may have various degrees of membership to their respective clusters. In addition, as a density-based algorithm, it tries to discriminate between dense areas, i.e. groups of elements that are very similar to each other, from sparse areas, i.e. larger distances between the elements.
<!-- For that purpose, given a minimum number of points $minPts$, it establishes a value $\varepsilon$ (epsilon) for each of the elements, defined as the minimum radius that covers the $minPts - 1$ nearest neighbours of the element. If $minPts$ is set to 8, the $\varepsilon$ of item $x$ is given by the radius around $x$ in which we can find the 7 items that are most similar to it. The smaller the $\varepsilon$, the denser the area in which $x$ is located. -->
In `r sc("hdbscan")`, the density of the area in which we find a point $a$ is estimated by calculating its
core distance $core_{k}(a)$, which is the distance to its $k$ nearest neighbour, $k$ being a parameter $minPts - 1$.
This measure is at the base of the mutual reachability distance, shown in equation \@ref(eq:mreach), which is used to compute a new distance matrix for a single-linkage hierarchical clustering algorithm. As a result, the items are organised in a hierarchical tree, from which clusters are selected based on the $minPts$ requirement and their densities. A related notion to $core_k(a)$ is $\varepsilon$, which is defined as the radius around a point in which $minPts - 1$ can be found.
\begin{equation}
d_{mreach}(a,b) = \max(core_{k}(a), core_{k}(b), d(a,b))
(\#eq:mreach)
\end{equation}
In `r sc("dbscan")`, we need to set both $minPts$ and a $\varepsilon$ threshold; the procedure is different, but its result is equivalent to cutting the hierarchical tree from `r sc("hdbscan")` at a fixed $\varepsilon$, so that the items above that threshold are discarded as noise, and those below it are grouped into their respective clusters. In contrast, its hierarchical version, `r sc("hdbscan")`, implements variable thresholds to maximize the stability of the clusters, and therefore only requires us to input $minPts$^[For a friendly description of how the algorithm works, I recommend @mcinnes.etal_2016 or even their conference presentations in YouTube.].
In R, the algorithm can be implemented with `dbscan::hdbscan()` [@R-dbscan]. Its input can be an item-by-feature matrix or, like in this case, a distance matrix. The output includes, among other things, the cluster assignment, with noise points assigned to a cluster 0, membership probability values, which are core distances normalized per cluster, and $\varepsilon$ values, which can be used as an estimate of density.
## Making it your own: parameter settings {#params}
Building models implies making a number of choices, from the source of the data and the unit of analysis, to the definition of what counts as context, to the techniques and parameters for visualization and clustering. Making these decisions explicit is crucial: on the one hand, they are necessary to interpret the models themselves, but on the other, they are essential for reproducibility.
For each of the 32 lemmas studied in this project, 200-212 models were created, resulting from the combination of parameter settings meant to define the first-order and second-order contexts. Other choices have been kept fixed across all models in this study, for various reasons, among which are practicality and best performance in the literature. The parameter space is virtually infinite, and exploring even more variations did would have increased the number of models exponentially and made the kind of thorough, qualitative descriptions performed here infeasible. Admittedly, some of the variable parameters could have remained fixed, and some of the fixed parameters could have been varied. Such paths remain open for future projects.
In this section, I will discuss these decisions: both the ones that have remained fixed across all the studies and the variations that characterize the multiple models under study. Fixed decisions are not specified in the names of the models; variable parameters, which distinguish models from each other, are coded in their names. When mentioned in further sections, they will be described in three parts: first-order parameters (Section \@ref(foc)), `PPMI` (Section \@ref(pmisel)) and second-order parameters (Section \@ref(soc)). The values of the parameter settings will be set in `monospace`.
### Fixed decisions {#corpus}
First, the analyses presented in this dissertation were performed on a corpus of Dutch and Flemish newspapers: the mode is written and the genre, journalistic. Called the *QLVLNewsCorpus* [@depascale_2019,30], it combines parts of the Twente Nieuws Corpus of Netherlandic Dutch [@ordelman.etal_2007] and the yet unpublished Leuven Nieuws Corpus. It comprises articles published between 1999 and 2004, belonging to popular and quality sources for both regions in equal proportion^[The newspapers include *Het Laatste Nieuws*, *Het Nieuwsblad*, *De Standaard* and *De Morgen* as Flemish sources and *Algemeen Dagblad*, *Het Parool*, *NRC Handelsblad* and *De Volkskrant* as Netherlandic sources.] and amounting to a total of 520 million tokens, including punctuation. The corpus was lemmatized and tagged with part-of-speech and dependency relations with Alpino [@vannoord_2006].
Second, the unit of analysis, the **lemma**, was defined as a combination of stem and part-of-speech^[The corpus was not lemmatized by stemmed.]. This applies to items at all levels: the definition of a target, the first-order context features and the second-order features; co-occurrence frequencies and association strength measures are always computed with the lemma as unit. Both distributional models and some traditions in collocation research may use word forms instead^[See @turney.pantel_2010 [155] and @sahlgren_2008 [47-48] for a discussion and @kiela.clark_2014 [25] for performance comparisons.]. On the one hand, stemming and tagging add a layer of processing and interpretation to the text; on the other, word forms of the same lemma tend to behave in different ways. From a lexicographic and lexicological perspective, however, it makes sense to use a lemma as a unit. It is the head of dictionary entries and a more typical unit of linguistic analysis. Furthermore, the (mis)match between word forms and lemmas strongly depends on the language under study: in languages like Spanish, French, Japanese and Dutch, verbs can take many more different forms than in English; conversely, Mandarin lacks morphological variation or even spaces between what could count as words. Concretely, the word form *hoop* in Dutch can correspond to the noun meaning either 'hope' or 'heap', or the verb meaning 'to hope', which can also take other forms such as *hopen*, *hoopt*, *hoopte* and *gehoopt* depending on person, number and tense. Our interest, from a lexicological perspective, lies more in line with studying the behaviour of the noun *hoop* and its meanings, than in conflating the noun with one of verbal forms of the homographic verb.
In that respect, a practical note is in order. The target items under study will be represented with dictionary forms in italics, followed by their approximate English translations in single quotation marks: e.g. *hoop* 'hope/heap', *heilzaam* 'healthy/beneficial', *herstructureren* 'to restructure'. Context words might be represented in figures with the stem and part-of-speech combination used by the lemmas, e.g. *word/verb*, but when mentioned in text the part-of-speech will be excluded, e.g. the passive auxiliary *word*. The English translations will belong to the same part-of-speech as the Dutch term in italics and be as unambiguous as possible. When the Dutch term and its English translations are written in the same way, no translation will be included, e.g. *journalist*.
Third, the context words at both first-order and second-order can, in principle, have any part of speech --- except for punctuation --- and must have a minimum relative frequency of 1 in 2 million (absolute frequency of 227) after discarding punctuation from the token count in the full *QLVLNewscorpus*. There are 60533 such lemmas in the corpus.
Finally, as described in Section \@ref(pmi), attraction between types were measured with `r sc("ppmi")`, computed on the full co-occurrence matrix, i.e. across the full corpus, based on a symmetric window of 4 tokens to either side, including punctuation; see @turney.pantel_2010 and @kiela.clark_2014 for alternatives. Token-level vectors are made by adding the type-level vectors of its context words.^[Alternatively, they could be multiplied or averaged, but the results were not all that different.]
For vector comparison, cosine distances were used and then transformed, as explained in Section \@ref(cosine). The transformed cosine distances were used both as input for visualization techniques and the clustering algorithm. Both non-metric `r sc("mds")` and t-`r sc("sne")` with perplexity values of 10, 20, 30 and 50 were explored, but the analyses discussed in the second part of the dissertation are based on the output from solutions with perplexity 30. Clustering was performed with `r sc("hdbscan")` setting $minPts = 8$.
### First-order selection parameters {#foc}
The immediate context of a token is the **first order context**: therefore, first-order parameters are those that influence which elements in the immediate environment of the token will be included in modelling said token. This was performed in two stages: one dependent on whether syntactic information was used, discussed in this section, and one independent of it, shown in Section \@ref(pmisel).
The decisions were based on a mix of literature [e.g. @kiela.clark_2014], tradition within the Nephological Semantics project, linguistic intuition and generalisations over the annotation of the concordance lines. As we will see in Chapter \@ref(dataset), the manual annotation procedure included selecting words in the context of each token that were the most helpful for the disambiguation. The window spans and dependency information of these chosen context words were used to inform some of the decisions below.
In a first stage, the main distinction is made between models based on bag-of-words (`BOW`), i.e. that do not care about word order or syntactic relationship, and those based on dependency (i.e. syntactic) information. Within the former group, models may vary based on whether sentence boundaries were respected, the length of the window size, and part-of-speech filters. The latter group includes models that select context words based on the distance between them and their target in terms of syntactic relationships (`(LEMMA)PATH` models), and models that find the context word that match specific, predefined templates (`(LEMMA)REL`). Each of these parameters will be described in more detail below.
The first split in `BOW` models distinguishes between those that include words outside the sentence of the target (`nobound`) and those that do not (`bound`). The goal was to make the models more comparable to dependency-based models, which by definition only include words in the same sentence as the target. However, models that only differ with respect to this parameter tend to be extremely similar.
More relevant is the window size: models can select context words on a symmetric window of 3, 5, or 10 tokens to either side of the target, including punctuation. Window sizes are typically larger for token-level models than for type-level models [e.g. @schutze_1998; @depascale_2019], but, at the same time, the great majority of the context words selected in the annotation were within the span of 3 words to either side. In practice, such a small span tends to be too restrictive.
Finally, some models refine their first-order selection with part-of-speech filters: `lex` models only include common nouns, adjectives, verbs and adverbs, while `all` models do not implement any restrictions. The selection defined for `lex` was the result of some trial and errors, but could use more refinement for future studies, e.g. expanding the lexical set to proper names, pronouns or only certain prepositions. Moreover, it could be useful to distinguish between modal verbs and auxiliaries, on one side, and other kinds of verbs, information that is not coded in the part-of-speech tags used in this corpus. In practice, `all` models tend to behave similarly to dependency-based models, while `lex` tends to be redundant with `r sc('ppmi')`-based selection, which will be described later.
Bag-of-words models will be indicated by a sequence of three values pointing to these three parameters: e.g. `bound5all` indicates a model that respects boundaries, with a window span of 5 words to each side and no part-of-speech filter.
The distinction between `BOW` and dependency-based models doesn't depend so much on which context words are selected but on how tailored the selection is to the specific
tokens. For example, a closed-class element like a preposition may be distinctive of particular usage patterns in which a term might occur. However, such a frequent, multifunctional word could easily occur in the immediate raw context of the target without actually being related to it. Unfortunately, just narrowing the window span doesn't solve the problem, since it would also drastically reduce the number of context words available for the token and for any other token in the model.
In contrast, we might also be interested in context words that are tightly linked to the target in syntactic terms but separated by many other words in between, but widening the window to include them would imply too much noise for this token and for any other token in the model.
A dependency-based model, instead, will only include context words in a certain syntactic relationship to the target, regardless of the number of words in between from a `BOW` perspective.
To exemplify, let's look at (@exgeschiedenis), where *herhalen* 'to repeat', in bold, is the target, and the items in italics where captured by a `PATH` model.
(@exgeschiedenis) `r readDutch("@exgeschiedenis")`
`r readTranslation("@exgeschiedenis")`
The `PATH` models count the steps between a target and all the words syntactically related to it and base the selection according to that distance. A one-step dependency path is either the head of the target or its direct dependent (the parent or the child, in kinship terms): in the case of (@exgeschiedenis) this includes the reflexive pronoun *zich* and the modifying adverb *werkelijk* 'really', which depend directly on it *herhalen* 'to repeat', as well as the modal *mocht*, on which the target depends.
A two-step dependency path is either the head of the head of the target (grandparent), the dependent of its dependent (grandchild), or its sibling. In (@exgeschiedenis) this includes the subject *geschiedenis* 'history', because it is linked to the target through the modal, and *Als* 'if'. All `PATH` models include the features in a one-step or two-step path from the target.
A three-step dependency path is either the head of the head of the head of the target (great-grandparent), the sibling of the head of its head (great-aunt), the dependent of the dependent of its dependent (great-grandchild), or the dependent of a sibling (niece). In (@exgeschiedenis) this corresponds to *de* 'the', which depends on *geschiedenis* 'history', and *geteld* 'counted', which *als* 'if' depends on. `PATHselection2` models do not include the three-steps path, and none of the `PATH` models include context words beyond these steps. The threshold was set based on the most frequent syntactic distance between the lemmas from the case studies and the context words selected as relevant for disambiguation. Next to `selection2`, `PATH` models take two more formats. While `selection3` models include context words up to 3 steps away from the target, `PATHweight` models also incorporate the distance information and give more weight to context words that are more directly closely to the target in the syntactic path.
Finally, `REL` models base their selection on specific, predefined patterns. For these purpose, templates tailored to the parts of speech of the target were designed, based on the relationships between the annotated types and the context words selected as most informative during the annotation process. The most restrictive model, `RELgroup1`, selects the following patterns:
- For nouns: `r lrel$nouns[[1]]`
- For adjectives: `r lrel$adjectives[[1]]`
- For verbs: `r lrel$verbs[[1]]`
It is typically too restrictive: for many lemmas, it is responsible for the loss of a large proportion of tokens which do not have context words that match these patterns, while the remaining tokens often have only one or two context words left. The `RELgroup2` models expand the selection as follows:
- For nouns: `r lrel$nouns[[2]]`
- For adjectives: `r lrel$adjectives[[2]]`
- For verbs: `r lrel$verbs[[2]]`
Finally, nouns also have a `RELgroup3` setting that incorporates the following relations:
- `r lrel$nouns[[3]]`
All the first-order parameters procure filters to select the context words in the environment of each token that will be used to model it. Alternatively, dependency information could have been included as a feature or dimension. For example, instead of selecting *zich* 'itself' as context word of the token in (@exgeschiedenis) based on its bag-of-word distance, part-of-speech filter or dependency relation to the target, we could use `(zich, se)` i.e. "has *zich* as reflexive subject" as a first-order feature. Its type-level vector then would have information on all the other verbs that take *geschiedenis* 'history' as its subject. For technical and practical reasons, this was not implemented in the studies discussed here, but would be a fruitful path for further research.
In the remainder of this dissertation, `BOW` will be used to refer to all bag-of-words based models, as opposed to the dependency-based models; `PATH` and `REL` will also be umbrella terms for the models that use the different kinds of dependency-based selection, and more specific terms, e.g. `PATHweight` will be used for finer grained distinctions.
### PPMI selection and weighting {#pmisel}
The `PPMI` parameter^[I use verbatim to refer to this parameter for two reasons. First, because, like `PATH`, it is also a value: the parameter itself is the association-strength-based filtering or weighting, but it was fixed to `r sc("ppmi")`. Second, this way it is easier to distinguish the parameter setting from `r sc("ppmi")` as a measure. It is not the best idea, but so it is coded into the current names of the models.] is taken outside the set of first-order parameters because it applies to both `BOW` and dependency-based models, although it also affects the selection of first order context words. The rationale behind it is that words in the vicinity of the target token, regardless of their part-of-speech and distance, are not equally informative of the meaning of the target. For example, in (@exgeschiedenis) *geschiedenis* 'history' and *zich* 'itself' are more informative of the meaning of *herhalen* 'to repeat' than *werkelijk* 'really' or *als* 'if'. Association strength measures like `r sc("ppmi")` could then be used to give more influence to the more informative context words; indeed, given a symmetric windowsize of 4 for the `r sc("ppmi")` computation in the *QLVLNewsCorpus*, the `r sc("ppmi")` of *geschiedenis* 'history' and *zich* 'itself' with *herhalen* 'to repeat' are 3.79 and 1.97 respectively, while the values for *werkelijk* 'really' and *als* 'if' are 0.06 and 0.112.
@heylen.etal_2015 weight the contribution of each context word by their `r sc("ppmi")` with the target, and @depascale_2019 adds `r sc("ppmi")` and `r sc("llr")` (log-likelihood ratio) thresholds to the selection of context words. However, these measures are meant to represent the relationship *between* types, not to distinguish between senses of the same type: a context word may be indicative of a sense of a word and yet not be particularly attracted to the word as a whole. An example is the English verb *to go*, which due to its high frequency does not have a strong attraction to the noun *church*, and yet is necessary to distinguish the specific sense of 'religious service' in *to go to church*.
For that reason, models can take three different settings in relation to the `PPMI` parameter: `weight`, `selection` and `no`. Both `weight` and `selection` apply an additional filter to the output from the first-order parameters and only select the context words with a positive `r sc('pmi')` with the target. They are distinct from the `PPMIno` models, which do not apply such thresholds. The difference between the first two is that `weight` also multiplies the type-level vector of each context word by their `r sc('pmi')` with the target, giving words that are more strongly associated to the target type a greater impact in the final vector of the target. The three settings are applied to each of the models resulting from the first-order combinations, with one exception: `PATHweight` models do not combine with `PPMIweight`.
### Second-order selection {#soc}
The selection of second-order features influences the shape of the vectors: how the selected first-order features are represented. Next to the fixed window size and association measure used to calculate the values of the vectors, there are two variable parameters. First, a part-of-speech filter may be applied. When its value is `nav`, second-order features are extracted from a pool of 13771 nouns, adjectives and verbs used in @depascale_2019^[The selection was originally made to ensure an unbiased regional distribution of the vectors for the lectometric studies performed in @depascale_2019.]. The alternative, `all`, applies no further filters. Second, we might reduce the length of the vector, i.e. the number of second-order features. One of the values, `5000`, selects the 5000 most frequent features from the pool remaining after the part-of-speech filter. Pilot studies have also explored models with 10000 dimensions, but they are very similar to the ones with 5000 dimensions.^[@kiela.clark_2014 discourage using vectors with more than `r format(50000, big.mark = ",")` dimensions.] The other value for the vector length is `FOC`, which stands for "first-order context", and it uses the union of first-order context words for all tokens as second-order dimensions. As a consequence, the second-order dimensions are tailored to the context of the sample, not necessarily so frequent, and their numbers remain in the hundreds, rarely surpassing 1500. In practice, there is not much of a difference between models with different second-order parameters, except for `5000all` models, which tend to perform the worst. Examination of the distance matrices between the type-level vectors of the context words reveals that the cosine distances between all of them are really large, probably due to the sparseness of the vectors. I that sense, it would be interesting to compare `r sc("svd")` matrices based on the `5000` models with the already smaller (and presumably denser) `FOC` models.
## The chosen ones: PAM {#pam}
The multiple variable parameters return a large number of models: 212 for each of the nouns --- because of the additional `REL` templates --- and 200 for verbs and adjectives. As we will see in Chapter \@ref(nephovis), we can combine distances between the models with dimensionality reduction techniques to represent the similarities between the models on a `r sc("2d")` space. In addition, if we only wanted to evaluate the models in relation to the manual annotation, we could rank the accuracy of their clustering solutions. However, if we want to understand the qualitative effect of the parameter settings on the modelling, and especially if we do not consider the manual annotation as a ground truth, we need to examine clouds individually, and it is not feasible for a human to look at each of the hundreds of models of each lemma.
One approach for an efficient exploration of the parameter space is to identify the settings that make the greater differences between models. For example, if we see that models with different `PPMI` settings are more different from each other than models with different vector-length settings, we would prioritize looking at models that differ on the former parameter, setting the latter to a constant value. Unfortunately, the quantitative effect of parameters is not so straightforward. First, the parameters that tend to make a big difference in the modelling include the choice between dependency and `BOW` and, within it, both window size and part-of-speech filters, as well as the distinction between `REL` and `PATH`. The resulting combinations are still too numerous to examine simultaneously (see Chapter \@ref(nephovis)). Second, the relevant parameters interact with each other: `PPMI` often makes little difference among `lex` models --- it tends to be redundant, since the open-class items captured by `lex` tend to have higher `r sc("ppmi")` --- but it makes a greater difference among `all` or dependency-based models. Finally, the various parameter settings do not have the same impact within each lemma, so they have to be revised for each of the lemmas under study.
The approach based on the quantitative effect of parameter settings on the distances between models does reduce the number of models to examine, but not to a great degree. Given the limited number of models that we can look at simultaneously while still making sense of them --- around 8 or 9 --- and the need to cover multiple combinations of these strong parameters, we would still need to look at four or five partially overlapping sets of 8-9 models per lemma. For example, a set of 9 models could be generated by taking `bound3` and `bound10` models with `PPMIweight` and `FOCnav` second-order vectors, in order to look at the effect of part-of-speech filter with little window-size variation, `PATH` and `REL`. Then, `PPMIweight` could be switched for `PPMIno` to look at the effect in the new conditions, resulting in 9 other models. If the effect is indeed different, which is likely, a different set of 8-9 models could then be generated with different values of `PPMI`, while keeping the part-of-speech to a constant value. These groups are not maximally different from each other: due to the interaction between parameters, many models are extremely similar, and a proper qualitative description becomes challenging. Moreover, a given set of models could reveal a pattern that was not captured in a previous set of models, and the researcher might want to go back and look for it.
An alternative approach is to use a clustering algorithm that, next to selecting groups of similar models, identifies the models that represent each of the clusters. **Partition Around Medoids**, or `r sc("pam")` [@kaufman.rousseeuw_1990], implemented in R with `cluster::pam()` [@R-cluster], does exactly that. Unlike `r sc('hdbscan')` and other clustering algorithms, it requires us to set a number of clusters beforehand, and then tries to find the organization that maximizes internal similarity within the cluster and distances with other clusters.
For our purpose, we have settled for 8 medoids for each lemma. The number is not meant to achieve the best clustering solutions --- no number could be applied to all the lemmas with equal success, given their variability in the differences between the models. The goal, instead, is to have a set of models that is small enough to visualize simultaneously (on a screen, in reasonable size) and big enough to cover the variation across models. For some lemmas, there might not be that much variation, and the **medoids**, i.e. the representative models, might be redundant with each other. However, as long as we can cover (most of) the visible variation across models and the medoids are reasonably good representatives of the models in their corresponding clusters, the method is fulfilling its goal.
The representativeness of medoids for the lemmas studied here has been tested in different ways.
<!-- Silhouette values --- a common measure of cluster quality --- have been computed, but they are not really relevant (nor high, I'm afraid). -->
We don't require the clusters of models to be different from each other, as long as the medoids represent them properly. Instead, the priority was to check for patterns within the models represented by each medoid, e.g. in terms of accuracy towards annotated senses. For example, if a medoid tends to group senses together very well (measured for example with `kNN` and `SIL`, as explained in Chapter \@ref(shapes) applied to clustering solutions), the models it represents have similar tendencies as well.
More importantly, different patterns previously identified in the plots while exploring the models with the first approach were looked for in the medoid selection, to corroborate that the medoids covered at least as much variation as the more time- and energy-consuming approach. All such patterns were found. In addition, small random samples within each cluster of models were visually scanned --- but not thoroughly examined --- to assess their similarity to their representative medoid. In the great majority of the cases the comparison was satisfactory. This has a wonderful effect on the visual exploration, because it lets us focus on 8-9 models that are quite different from each other instead of multiple sets of models with less variation. Visually, the medoids approach is more informative and less tiresome.
As a result from these explorations, the qualitative analyses will be based on medoids: representative models selected by `r sc("pam")`. While this is a clustering algorithm, in order to avoid confusion with clusters of tokens, which take centre stage, I will avoid referring to the clusters of models as such --- or, if I do, I will specify that they are clusters *of models*. The preferred name will be "the models represented by the medoid". Given that the only clustering algorithm used on the tokens is `r sc("hdbscan")`, *medoid* will always refer to a representative model.
## Summary {#workflow-summary}
The process through which token-level vector space models and the clouds studied here in particular are created, takes a number of transformative steps. In this chapter we have broken down this process and detailed the layers of mathematical and linguistic processing lying between the raw corpus and the final clouds. Next to an overall description of the workflow, the technical background of the most important aspects was introduced in some detail. Afterwards, I explained the parameter settings that characterize the models analyzed in this project.
Choices have been made and alternatives have been suggested: the path taken here was one out of so many possible alternatives. In fact, at the core of this research project is the exploration of alternatives, the investigation of the effect of the variable parameters on the final linguistic representation, and the search for clues, guidelines, a recipe for the clouds we seek. This exploration combines quantitative techniques --- the heart of the process of cloud creation --- with qualitative analyses meant to describe what and how the clouds are really modelling.
By combining the vector representations with visualization techniques and/or clustering algorithms, we can make sense of patterns that would otherwise escape us. Visual analytics provides us with tools to explore the output in comfortable, intuitive --- but sometimes deceiving --- ways. In the next chapter, we will look at the two visualization tools developed within the larger project of Nephological Semantics to enable and support these qualitative analyses.