-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-nephovis.Rmd
198 lines (127 loc) · 39.7 KB
/
03-nephovis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# Visualization tools {#nephovis}
Clouds are the prime matter of these study. They are condensed, information-rich representations of patterns found in a corpus and should, according to the Distributional Hypothesis, tell us something about the meaning of the words under examination. But they don't tell us anything by themselves: we need to develop and implement tools to extract this information. Chief among these tools is a web-based visualization tool [@montes.qlvl_2021], originally developed by Thomas Wielfaert within the Nephological Semantics project [see @wielfaert.etal_2019], and then continued by myself^[The GitHub repository is linked to Zenodo, so that the released versions can be stored and identified with a `r sc("doi")`. Unfortunately, even though the foundations of the code were set by Thomas Wielfaert, because of how the current repository came to be, he has no history as contributor and therefore is not assigned as author in the tool's citation.]. In this chapter we will present its rationale and the features it offers, as an elaboration of @montes.heylen_2022.
Section \@ref(nepho-overview) will offer an overview of the rationale behind the tool and the minimal path that a researcher could take through its levels. Sections \@ref(nepho1) through \@ref(nepho3) will zoom in on each of the levels, describing the current features and those that are still waiting on our wish list. Section \@ref(shiny) follows with the description of a ShinyApp [@R-shiny]: an extension^[Currently available at https://marianamontes.shinyapps.io/Level3/] to the third level of the visualization with additional features tailored to exploring the relationship between the `r sc("2d")` representations and the `r sc("hdbscan")` output. Finally, we conclude with a summary in Section \@ref(nepho-summary).
## Flying through the clouds {#nepho-overview}
The visualization tool described here, which I will call *NephoVis*, was written in Javascript, making heavy use of the [`r sc("d3")`.js](https://d3.js) library, which was designed for beautiful web-based data-driven visualization [@bostock.etal_2011]. The `r sc("d3")` library allows the designer to link elements on the page, such as circles in an `r sc("svg")`, dropdown buttons and titles, to data structures such as arrays and data frames, and manipulate the visual elements based on the values of the linked data items. In addition, it offers handy functions for scaling and mapping, i.e. to fit the relatively arbitrary ranges of the coordinates to pixels on a screen, or to map a colour palette^[While `r sc("d3")` offers a variety of useful colour palettes, the visualization currently relies on a --- slightly adapted --- colorblind-friendly scale by @okabe.ito_2002. The default colour palette for most of the figures in this disseration make use of the same palette, via the R package `colorblindr` [@R-colorblindr].] to a set of categorical values.
As we have seen in Chapter \@ref(workflow), the final output of the modelling procedure is a `r sc("2d")` representation of distances between tokens, which can be visualized as a scatterplot. Crucially, we are not only interested in exploring individual models, but in comparing a range of models generated by variable parameters. Section \@ref(cosine) discussed a procedure to measure the distance between models, which can be provided as input for non-metric `r sc("mds")`, and Section \@ref(pam) presented the technique used to select representative models, or medoids. As a result, we have access to the following datasets for each of the lemmas:
- A distance matrix between models.
- A data frame with one row per model, the `r sc("nmds")` coordinates based on the distance matrix, and columns coding the different variable parameters or other pieces of useful information, such as the number of modelled tokens.
- A data frame with one row per token, `r sc("2d")` coordinates for each of their models and other information such as sense annotation (see Chapter \@ref(dataset)), country, type of newspaper, selection of context words and concordance line.
- A data frame with one row per first-order context word and useful frequency information.
In practice, the data frame for the tokens is split in multiple data frames with coordinates corresponding to different dimensionality reduction algorithms, such as `r sc("nmds")` and t-`r sc("sne")` with different perplexity values, and another data frame for the rest of the information. In addition, one of the most recent features of the visualization tool includes the possibility to compare an individual token-level model with the representation of the type-level modelling of its first-order context words. However, this feature is still under development within NephoVis and can be better explored in the ShinyApp extension (Section \@ref(shiny)).
In order to facilitate the exploration of all this information, NephoVis is organized in three levels, following Shneiderman's Visual Information Seeking Mantra: “Overview first, zoom and filter, then details-on-demand” [-@shneiderman_1996,97]. The core of the tool is the interactive, zoomable scatterplot, but its goal and functionality is adapted to each of the three levels.
In Level 1 the scatterplot represents the full set of models and allows the user to explore the quantitative effect of different parameter settings and to select a small number of models for detailed exploration in Level 2.
This second level shows multiple token-level scatterplots --- one for each of the selected models ---, and therefore offers the possibility to compare the shape and organization of the groups of tokens across different models. By selecting one of these models, the user can examine it in Level 3, which focuses on only one at a time. @shneiderman_1996's mantra underlies both the flow across levels and the features within them: each level is a zoomed in, filtered version of the level before it; the individual plots in Levels 1 and 3 are literally zoomable; and in all cases it is possible to select items for more detailed inspection. Finally, a number of features --- tooltips and pop-up tables --- show details on demand, such as the names of the models in Level 1 and the context of the tokens in the other two levels.
(ref:nepho-index) Portal of [https://qlvl.github.io/NephoVis/](https://qlvl.github.io/NephoVis/) as of July 2021.
```{r, nepho-index, fig.cap = "(ref:nepho-index)", out.width="100%"}
cloud_foto("nephovis-index")
```
Currently, [https://qlvl.github.io/NephoVis/](https://qlvl.github.io/NephoVis/) hosts the portal shown in Figure \@ref(fig:nepho-index), which eventually leads the user to the Level 1 page for the lemma of their choice^[By knowing the lemma, it is possible to go directly to the Level 1 page by replacing *lemma* in https://qlvl.github.io/NephoVis/level1.html?type=lemma with the corresponding name of the lemma, e.g. *heffen*.], shown in Figure \@ref(fig:nepho1-basic) and described in more detail in Section \@ref(nepho1). By exploring the scatterplot of models, the user can look for structure in the distribution of the parameters on the plot.
For example, colour coding may reveal that models with nouns, adjectives, verbs and adverbs as first-order context words (`lex`) are very different from those without strong filters for part-of-speech, because mapping these values to colours reveals distinct groups in the plot. In contrast, mapping the sentence boundaries restriction (`bound`/`nobound`) might result in a mix of dots of different colours, like a fallen bag of `r sc("m&m")`'s, meaning that the parameter makes little difference. Depending on whether the user wants to compare models similar or different to each other, or which parameters they would like to keep fixed, they will use individual selection or the buttons to choose models for Level 2. The
`r fontawesome::fa("filter")` **Select medoids**
button^[Incorporating this feature is less scalable than the dropdown menus or even the checkbox buttons; it works with the current pipeline, but is not so straightforward to adapt to new data that does not follow the exact same pipeline. Flexibilizing the features to allow for missing data frames is one of the items on the wish list. Ideally, future versions will implement algorithms such as `r sc("pam")` to compute on the fly as well.] quickly identifies the predefined medoids. By clicking on the
`r fontawesome::fa("arrow-alt-circle-down")` **LEVEL 2** button,
Level 2 is opened in a new tab, as shown in Figure \@ref(fig:nepho2-basic).
In Level 2, the user can already compare the shapes that the models take in their respective plots, the distribution of categories like sense labels, and the number of lost tokens. If multiple dimensionality reduction techniques have been used, the
`r fontawesome::fa("list-ul")` **Switch solution** button allows the user to select one and watch the models readjust to the new coordinates in a short animation. In addition, the
`r fontawesome::fa("sticky-note")` **Distance matrix** button offers a heatmap of the pairwise distances between the selected models.
Section \@ref(nepho2) will explore the most relevant features that aid the comparison across models, such as brushing sections of a model to find the same tokens in different models and accessing a table with frequency information of the context words co-occurring with the selected tokens. Either by clicking on the name of a model or through the
`r fontawesome::fa("arrow-alt-circle-down")` **Go to model** dropdown menu, the user can access Level 3 and explore the scatterplot of an individual model. As Section \@ref(nepho3) will show, Level 2 and Level 3, both built around token-level scatterplots, share a large number of functionalities. The difference lies in the possibility of examining features particular of a model, such as reading annotated concordance lines highlighting the information captured by the model or selecting tokens based on the words that co-occur with it. In practice, the user would switch back and forth between Level 2 and Level 3: between comparing a number of models and digging into particular models.
(ref:nepho1-basic) [Level 1 for *heffen* 'to levy/to lift'](https://qlvl.github.io/NephoVis/level1.html?type=heffen).
```{r, nepho1-basic, fig.cap = "(ref:nepho1-basic)", out.width="100%"}
cloud_foto("nephovis-level1-base")
```
(ref:nepho2-basic) Level 2 for the medoids of *heffen* 'to levy/to lift'.
```{r, nepho2-basic, fig.cap = "(ref:nepho2-basic)", out.width="100%"}
cloud_foto("nephovis-level2-base")
```
Before going into the detailed description of each level, a note is in order. As already mentioned in Section \@ref(dim-reduction), the dimensions resulting from `r sc("nmds")` --- used in all levels --- and t-`r sc("sne")` --- used in levels 2 and 3 --- are not meaningful. In consequence, there are no axes or axis ticks in the plots. However, the units are kept constant within each plot: one unit on the $x$-axis has the same length in pixels as one unit on a $y$-axis within the same scatterplot; this equality, however, is not valid across plots. Finally, the code is open-source and available at https://github.com/qlvl/NephoVis.
## Level 1 {#nepho1}
The protagonist of Level 1 is an interactive zoomable scatterplot where each glyph, by default a steel blue wye ("Y"), represents one model. This scatterplot aims to represent the similarity between models as coded by the `r sc("nmds")` output and allows the user to select the models to inspect according to different criteria. Categorical variables (e.g. whether sentence boundaries are used) can be mapped to colours and shapes, as shown in Figure \@ref(fig:nepho1-colours), and numerical variables (e.g. number of tokens in the model) can be mapped to size.
(ref:nepho1-colours) [Level 1 for *heffen* 'to levy/to lift'](https://qlvl.github.io/NephoVis/level1.html?type=heffen); the plot is colour-coded with first-order part-of-speech settings; `NA` stands for the dependency-based models.
```{r, nepho1-colours, fig.cap = "(ref:nepho1-colours)", out.width="100%"}
cloud_foto("nephovis-level1-colours")
```
A selection of buttons on the left panel, as well as the legends for colour and shape, can be used to filter models with a certain parameter setting. These options are generated automatically by reading the columns in the data frame of models and interpreting column names starting with `foc_` as representing first-order parameter settings, and those starting with `soc_` as second-order parameter settings. Different settings of the same parameter interact with an `OR` relationship, since they are mutually exclusive, while settings of different parameters combine in an `AND` relationship. For example, by clicking on the grey `bound` and `lex` buttons on the bottom left, only `BOW` models with part-of-speech filter and sentence boundary limits^[Notice that `bound` itself, while a `BOW` parameter value, also includes the dependency-based models, since they are automatically limited to sentence boundaries.] will be selected. By clicking on both `lex` and `all`, all `BOW` models are selected, regardless of the part-of-speech filter, but dependency-based models (for which part-of-speech is not relevant) are excluded. A counter above, circled in Figure \@ref(fig:nepho1-selection), keeps track of the number of selected items, since Level 2 only allows up to 9 models for comparison^[The original design, found in http://tokenclouds.github.io/LeTok/, allowed for larger selections; only 9 models would be actually shown in Level 2, but it would also be possible to remove some of them and make place to the models left on the waiting list. This makes sense when models are selected individually and in a particular order, i.e. by clicking on them, but not so much for selections based on other criteria that we want to explore simultaneously.]. This procedure is meant to aid a selection based on relevant parameters, as described in Section \@ref(pam). In Figure \@ref(fig:nepho1-selection), instead, the `r fontawesome::fa("filter")` **Select medoids** button was used to quickly capture the medoids obtained from `r sc("pam")`.
Models can also be manually selected by clicking on the glyphs that represent them.
(ref:nepho1-selection) [Level 1 for *heffen* 'to levy/to lift'](https://qlvl.github.io/NephoVis/level1.html?type=heffen) with medoids highlighted.
```{r, nepho1-selection, fig.cap = "(ref:nepho1-selection)", out.width="100%"}
cloud_foto("nephovis-level1-selection")
```
## Level 2 {#nepho2}
Level 2 shows an array of small scatterplots, each of which represents a token-level model. The glyphs, by default steel blue circles, stand for individual tokens, i.e. attestations of the chosen lemma in a given sample. The original code for this level was inspired by [Mike Bostock's brushable scatterplot matrix](https://bl.ocks.org/mbostock/3213173), but it is not a scatterplot matrix itself, and its current implementation is somewhat different.
The dropdown menus on the sidebar (Figure \@ref(fig:nepho2-basic)) read the columns in the data frame of variables, which can include any sort of information for each of the tokens, such as sense annotation, sources, number of context words in a model, concordance lines, etc. Categorical variables can be used for colour- and shape-coding, as shown in Figure \@ref(fig:nepho2-colour), where the senses of the chosen lemma are mapped to colours; numerical variables, such as the number of context words selected by a given lemma, can be mapped to size. Note that the mapping will be applied equally to all the selected models: the current code does not allow for variables --- other than the coordinates themselves --- to adapt to the specific model in each scatterplot. That is the purview of Level 3.
Before further examining the scatterplots, a small note should be made about the distance matrix mentioned above. The heatmap corresponding to the medoids of *heffen* 'to levy/to lift' is shown in Figure \@ref(fig:nepho2-heatmap).
The `r sc("nmds")` representation in Level 1 tried to find patterns and keep the relative distances between the models as faithful to their original positions as possible, but such a transformation always loses information. Given a restricted selection of models, however, the actual distances can be examined and compared more easily. A heatmap maps the range of values to the intensity of the colours, making patterns of similar/different objects easier to identify. For example, Figure \@ref(fig:nepho2-heatmap) shows that the sixth medoid is very different from all the other medoids except from the seventh, and that the second medoids is quite different from all the others except the first. Especially when the model selection followed a criterion based on strong parameter settings, e.g. keeping `PPMI` constant to look at the interaction between window size and part-of-speech filters, such a heatmap could reveal patterns that are slightly distorted by the dimensionality reduction in Level 1 and even hard to pinpoint from visually comparing the scatterplots. But even with the medoid selection, which aims to find representatives that are maximally different from each other (or at least that are the core elements of maximally different *groups*), the heatmap can show whether some medoids are drastically *more* different, or conversely, similar to others.
As a reference, the heatmap is particularly useful to check hypotheses about the visual similarity of models. For example, unlike with *heffen* 'to levy/to lift' in Figure \@ref(fig:nepho2-colour), if we colour-code the medoids of *haten* 'to hate' with the manual annotation (Figure \@ref(fig:nepho2-haten)), all the models look equally messy. As we will see below, we can brush over sections of the plot to see if, at least, the tokens that are close together in one medoid are also close together in another (spoiler alert: not the case). The heatmap of distances confirms that not all models are equally different from each other, but indeed, each of them are messy in their own particular way.
(ref:nepho2-heatmap) Heatmap of distances between medoids of *heffen* 'to levy/to lift'.
```{r, nepho2-heatmap, fig.cap = "(ref:nepho2-heatmap)", out.width="100%"}
cloud_foto("nephovis-level2-heatmap")
```
(ref:nepho2-haten) 2D representation of medoids of *haten* 'to hate', colour-coded with senses, next to the heatmap of distances between models.
```{r, nepho2-haten, fig.cap = "(ref:nepho2-haten)", out.width="100%"}
cloud_foto("nephovis-level2-haten", "png")
```
Next to the colour-coding, Figure \@ref(fig:nepho2-colour) also illustrates how hovering over a token shows the corresponding identifier^[The identifier of a token includes four main pieces of information separated by slashes. The first two, stem and part-of-speech (*hef* and *verb* in the example) indicate the target lemma. The third section points to the filename from which the token was extracted. The filenames from this corpus have at least three sections split by underscores: the name of the newspaper (*De Volkskrant*), the date of publication in `r sc("yyyy-mm-dd")` format (2001-06-21) and the number of the article, among those harvested for the corpus (36). The final part points to the index of the token in the article including punctuation: in this case, the word form *hieven* (third person plural preteritum of *heffen*, '(they) lifted') around which the concordance line is built is the 163rd token in its file.] and concordance line. Figure \@ref(fig:nepho2-selection), on the other hand, showcases the brush-and-link functionality. By brushing over a specific model, the tokens found in that area are highlighted and the rest are made more transparent. Such a functionality is missing from Level 1, but is also available in Level 3. Level 2 enhances the power of this feature by selecting the same tokens in the rest of the models, whatever area they occupy. Thus, we can see whether tokens that are close together in one model are still close together in a different model, which is specially handy in more uniform plots, like the one for *haten* 'to hate' in Figure \@ref(fig:nepho2-haten). Figure \@ref(fig:nepho2-selection) reveals that the tokens selected in the second medoid are, indeed, quite well grouped in the other five medoids around it, with different degrees of compactness. It also highlights two glyphs on the right margin of the bottom right plot. In Level 2, this margin gathers all the tokens that were selected for modelling but were lost by the model in question due to lack of context words. In this case medoid 6, with a combination of `bound3lex` and `PPMIselection`, is extremely selective, and for a few tokens no context words could be captured.
(ref:nepho2-colour) Level 2 for the medoids of *heffen* 'to levy/to lift', colour-coded with categories from manual annotation. Hovering over a token shows its concordance line.
```{r, nepho2-colour, fig.cap = "(ref:nepho2-colour)", out.width="100%"}
cloud_foto("nephovis-level2-colour-and-tooltip", "png")
```
(ref:nepho2-selection) Level 2 for the medoids of *heffen*, colour coded with categories from manual annotation. Brushing over an area in a plot selects the tokens in that area and their positions in other models.
```{r, nepho2-selection, fig.cap = "(ref:nepho2-selection)", out.width="100%"}
cloud_foto("nephovis-level2-selection")
```
In any given model, we expect tokens to be close together because they share a context word, and/or because their context words are distributionally similar to each other: their type-level vectors are near neighbours. Therefore, when inspecting a model, we might want to know which context word(s) pull certain tokens together, or why tokens that we expect to be together are far apart instead. For individual models, this can be best achieved via the ShinyApp described in Section \@ref(shiny), but NephoVis also includes features to explore the effect of context words, such as frequency tables.
In Level 2, while comparing different models, the frequency table has one row per context word and one or two columns per selected model, e.g. the medoids. Such a table is shown in Figure \@ref(fig:nepho2-table).
The columns in this table are all computed by NephoVis itself based on lists of context words per token per model. Next to the column with the name of the context word, the default table shows two columns called "total" and two per model, headed by the corresponding number and either a "+" or a "-" sign. The "+" columns indicate how many *of the selected tokens* in that model co-occur with the word in the row; the "-" columns indicate the number of non selected tokens that co-occur with the word. The "total" columns indicate, respectively, the number of selected or non-selected tokens for which that context word was captured by at least one model.
Here it is crucial to understand that, when it comes to distributional modelling, a **context word** is not simply a word that can be found in the concordance line of the token, but an item captured by a given model. Therefore, a word can be a context word in a model, but be excluded by a different model with stricter filters. For example, the screenshot^[The full picture is impractical to include in a printed text; it is recommended to explore the tool interactively instead.] in Figure \@ref(fig:nepho2-table) gives us a glimpse of the frequency table corresponding to the tokens selected already in Figure \@ref(fig:nepho2-selection). The most frequent context word for the 31 selected tokens, i.e. the first row of the table, is the noun *glas* 'glass', which is used in expressions such as *een glas heffen op iemand* 'to toast for someone, lit. to lift a glass on someone'. The columns for models 1 an 2 show that *glas* 'glass' was captured by those models for all 31 selected tokens. In column 3, however, the positive column reads 29, which indicates that the model missed the co-occurrence of *glas* 'glass' in two of the tokens. The names on top of the plots reveal that the first two models have a window size of 10, while the third restricts it to 5, meaning that in the two missed tokens *glas* 'glass' occurs 6 to 10 slots away from the target. These are most likely the orange tokens a bit far to the right of the main highlighted area in the third plot. Finally, in the fourth model, which is hidden behind the table, *glas* 'glass' is missed from one of the 31 tokens but captured in 2 tokens that were excluded from the selection. If we moved the window of the table we would see that this is a `PATHweight` model: the missed co-occurrence must be within the `r sc("bow")` window span but too far in the dependency path, wile the two captured co-occurrences in the "-" column must be within three steps of the dependency path but beyond the `r sc("bow")` window span of 10. This useful frequency information is available for all the context words that are captured by at least one model in any of the selected tokens. In addition, the **Select information** dropdown menu gives access to a range of transformations based on these frequencies, such as odds ratio, Fisher Exact and cue validity.
(ref:nepho2-table) Level 2 for the medoids of *heffen* 'to levy/to lift', and frequency table of the the context words co-occurring with the selected tokens across models.
```{r, nepho2-table, fig.cap = "(ref:nepho2-table)", out.width="100%"}
cloud_foto("nephovis-level2-table")
```
The layout of Level 2, showing multiple plots at the same time and linking the tokens across models, is a fruitful source of information, but it has its limits. To exploit more model-specific information, we go to Level 3.
## Level 3 {#nepho3}
Level 3 of the visualization tool shows a zoomable, interactive scatterplot in which each glyph, by default a steel blue circle, represents a token, i.e. an attestation of the target lexical item. An optional second plot has been added to the right, in which each glyph, by default a steel blue star, represents a first-order context word, and the coordinates derive from applying the same dimensionality reduction technique on the type-level cosine distances between the context words.
The name of the model, coding the parameter settings, is indicated on the top, followed by information on the dimensionality reduction technique. Like in the other two levels, it is possible to map colours and shapes to categorical variables, e.g. sense labels, and sizes to numerical variables, e.g. number of available context words, and the legends are clickable, allowing the user to quickly select the items with a given value.
Figure \@ref(fig:nepho3-base) shows what Level 3 looks like if we access it by clicking on the name of the second model in Figure \@ref(fig:nepho2-selection). Colour-coding and selection are transferred between the levels, so we can keep working on the same information if we wish to do so. Conversely, we could change the mappings and selections on Level 3, based on model-specific information, and then return to Level 2 (and refresh the page) to compare the result across models. For example, if the frequency table in Figure \@ref(fig:nepho2-table) had shown us that *glas* 'glass' was also captured in tokens outside our selection, or if we had reason to believe that not all of the selected tokens co-occurred with *glas* 'glass' in this model, we could input `glas/noun` on the `Features in model` field in order to select all the tokens for which *glas* 'glass' was captured in the model, and only those. Or maybe we would like to find the tokens in which *glasje* 'small glass' occurs, but we are not sure how they are lemmatized, so we can input `glasje` in the `Context words` field to find the tokens that include this word form in the concordance line, regardless of whether its lemma was captured by the model^[Admittedly, the names of the fields can be confusing and should probably be changed. Both fields work with partial regex matches, but `Features in model` look in the list of captured context words, which is a list of lemmas, while `Context words` performs the search on the concordance line, i.e. word forms, regardless of whether the model captured them.].
In sum, (groups of) tokens can be selected in different ways, either by searching words, inputting the id of the token, clicking on the glyphs or brushing over the plots.^[The (beta) feature of the type-level plot on the right side also enables token selection by clicking on the co-occurring context words (and *vice versa*) but this is still under development.] Given such a selection, clicking on `r fontawesome::fa("sticky-note")` **Open frequency table** will call a pop-up table with one row per context word, a column indicating in how many of the selected tokens it occurs, and more columns with pre-computed information such as `r sc("pmi")` (see Figure \@ref(fig:nepho3-table)). These values can be interesting if we would like to strengthen or weaken filters for a smarter selection of context words.
Like Level 2, Level 3 also offers the concordance line of a token when hovering over it. But unlike Level 2, the concordance can be tailored to the specific model on focus, as shown in Figure \@ref(fig:nepho3-base). The visualization tool itself does not generate a tailored concordance line for each model, but finds a column on the data frame that starts with `_ctxt` and matches the beginning of the name of the model to identify the relevant format. A similar system is used to find the appropriate list of context words captured by the model for each token. For these models, the selected context words are shown in boldface and, for `PPMIweight` models such as the one shown in Figure \@ref(fig:nepho3-base), their `r sc("ppmi")` values with the target, e.g. *heffen*, are shown in superscript.
(ref:nepho3-base) Level 3 for the second medoid of *heffen* 'to levy/to lift': `r nameModel(names(d$heffen$medoidCoords[2]))` with some selected tokens. Hovering over a token shows tailored concordance line.
```{r, nepho3-base, fig.cap = "(ref:nepho3-base)", out.width="100%"}
cloud_foto("nephovis-level3-base", "png")
```
(ref:nepho3-table) Level 3 for the second medoid of *heffen* 'to levy/to lift': `r nameModel(names(d$heffen$medoidCoords[2]))`. The frequency table gives additional information on the context words co-occurring with the selected tokens.
```{r, nepho3-table, fig.cap = "(ref:nepho3-table)", out.width="100%"}
cloud_foto("nephovis-level3-table")
```
As we have seen along this chapter, the modelling pipeline returns a wealth of information that requires a complex visualization tool to make sense of it and exploit it efficiently. The Javascript tool described up to now, NephoVis, was developed and used by the same people within the Nephological Semantics projects, but is meant to be deployed to a much broader audience that could benefit from its multiple features. It can still grow, and its [open-source code](https://github.com/qlvl/NephoVis/) makes it possible for anyone to adapt it and develop it further. Nevertheless, for practicality reasons, an extension was developed in a different language: R. The dashboard described in the next section elaborates on some ideas originally thought for NephoVis and particularly tailored to explore the relationship between the t-`r sc("sne")` solutions and the `r sc("hdbscan")` clusters on individual medoids.
## ShinyApp {#shiny}
The visualization tool discussed in this section was written in R with the `shiny` library [@R-shiny], which provides R functions that return `r sc("html")`, `r sc("css")` and Javascript for interactive web-based interfaces. The interactive plots have been rendered with `plotly` [@R-plotly]. Unlike NephoVis, this tool requires an R server to run, so it is hosted on `shinyapps.io` instead of a static Github Page^[This code is also freely available at https://github.com/montesmariana/Level3.]. It takes the form of a dashboard, shown in Figure \@ref(fig:shiny-basic), with a few tabs, multiple boxes and dropdown menus to explore different lemmas and their medoids. All the functionalities are described in the About page of the dashboard, so here only the most relevant features will be described and illustrated.
The sidebar of the dashboard offers a range of controls. Next to the choice between viewing the dashboard and reading the documentation, two dropdown menus offer the available lemmas and their medoids, by number. By selecting one, the full dashboard adapts to return the appropriate information, including the name of the model on the orange header on top. The bottom half of the sidebar gives us control over the definition of relevant context words in terms of minimum frequency, recall and precision, which will be explained below.
(ref:shiny-basic) Starting view of the [ShinyApp dashboard](https://marianamontes.shinyapps.io/Level3/), extension of Level 3.
```{r, shiny-basic, fig.cap = "(ref:shiny-basic)", out.width="100%"}
cloud_foto("shinyapp-basic")
```
The main tab, **t-SNE**, contains four collapsable boxes: the blue ones focus on tokens while the green ones, on first-order context words. The top boxes (Figure \@ref(fig:shiny-tooltips)) show t-`r sc("sne")` representations (perplexity 30) of tokens and their context words respectively, like we would find on Level 3 of NephoVis. However, the differences with NephoVis are crucial.
First, the colours match pre-computed `r sc("hdbscan")` clusters ($minPts = 8$) and cannot be changed; in addition, the transparency of the tokens reflects their $\varepsilon$. The goal of this dashboard is, after all, to combine the `r sc("2d")` visualization and the `r sc("hdbscan")` clustering for a better understanding of the models. This functionality is not currently available in NephoVis because, unlike sense tags, it is a model-dependent categorical variable^[The current code is not suited to adapt the automatic selection of categorical variables to model-dependent ones, and adding the clustering solution for each of the models would clutter the list of categorical variables.].
Second, the type-level plot does not use stars but the lemmas of the context words themselves. More importantly, they are matched to the `r sc("hdbscan")` clusters based on the measures of frequency, precision and recall. In short, only context words that can be deemed relevant for the definition or characterization of a cluster are clearly visible and assigned the colour of the cluster they represent best; the rest of the context words are faded in the background. A radio button on the sidebar offers the option to highlight context words that are "relevant" for the noise tokens as well.
Third, the tooltips offer different information from NephoVis: the list of captured context words in the case of tokens, and the relevance measures as well as the nearest neighbours of the context word in the type-level plot. For example, in the left side of Figure \@ref(fig:shiny-tooltips) we see the same token-level model shown in Figure \@ref(fig:nepho3-base). Hovering over one of the tokens in the bottom left light blue cluster, we can see the list of context words that the model caputes for it: the same we could have seen in bold in the NephoVis rendering by hovering over the same token. Among them, *glas/noun* 'glass' is highlighted, because it is the only one that surpasses the relevance thresholds we have set. On the right side of the figure, i.e. the type-level plot we can see the similarities between the context words that surpass these thresholds for any cluster, and hovering on one of them provides us with additional information. In the case of *glas/noun* 'glass', the first line reports that it represents 31 tokens in the light blue `r sc("hdbscan")` clusters, with a recall of 0.94, i.e. it co-occurs with 94% of the tokens in the cluster, and a precision of 1, i.e. it only co-occurs with tokens in that cluster. Below we see a list of the nearest neighbours, that is, the context words most similar to it at type-level and their cosine similarity. The fact that the similarity with its nearest neighbour is 0.77 (in a range from 0 to 1) is worrisome.
(ref:shiny-tooltips) Top boxes of the **t-SNE** tab of the [ShinyApp dashboard](https://marianamontes.shinyapps.io/Level3/), with active tooltips.
```{r, shiny-tooltips, fig.cap = "(ref:shiny-tooltips)", out.width="100%"}
cloud_foto("shinyapp-tooltips", "png")
```
The two bottom boxes of the tab show, respectively, the concordance lines with highlighted context words and information on cluster and sense, and a scatterplot mapping each context word to its precision, recall and frequency in each cluster. The darker lines inside the plot are a guide towards the threshold: in this case, relevant context words need to have minimum precision or recall of 0.5, but if they were modified the lines would move accordingly. The colours indicate the cluster the context word represents, and the size its frequency in it, also reported in the tooltip. Unlike in the type-level plot above, here we can see whether context words co-occur with tokens from different clusters. Figure \@ref(fig:shiny-bottom) shows the right-side box next to the top token-level box. When one of its dots is clicked, the context words co-occurring with that context word --- regardless of their cluster --- will be highlighted in the token-level plot, and the table of concordance lines will be filtered to the same selection of tokens.
(ref:shiny-bottom) Token-level plot and bottom plot of context words in the **t-SNE** tab of the [ShinyApp dashboard](https://marianamontes.shinyapps.io/Level3/), with one context word selected.
```{r, shiny-bottom, fig.cap = "(ref:shiny-bottom)", out.width="100%"}
cloud_foto("shinyapp-selection")
```
The first tab of this dashboard is an extremely useful tool to explore the `r sc("hdbscan")` clusters, their (mis)match with the t-`r sc("sne")` representation and the role of the context words. In addition, the **HDBSCAN structure** tab provides information on the proportion of noise per medoid and the relationship between $\varepsilon$ and sense distribution in each cluster. Finally, the **Heatmap** tab illustrates the type-level distances between the relevant context words, ordered and coloured by cluster, as shown in Figure \@ref(fig:shiny-heatmap). In some cases, it confirms the patterns found in the type-level plot; in others, like this model, it shows that most of the context words are extremely different from each other, forming no clear patterns. This is a typical result in `5000all` models like the one shown here and tends to lead to bad token-level models as well.
(ref:shiny-heatmap) Heatmap of type-level distances between relevant context words in the [ShinyApp dashboard](https://marianamontes.shinyapps.io/Level3/).
```{r, shiny-heatmap, fig.cap = "(ref:shiny-heatmap)", out.width="100%"}
cloud_foto("shinyapp-heatmap")
```
## Summary {#nepho-summary}
In this chapter two visualization tools for the exploration of token-level distributional models have been described. Both are open-source, web-based and interactive. They were developed within the Nephological Semantics projects at KU Leuven and constitute the backbone of the research described in this dissertation.
Data visualization can be beautiful and contribute to successful communication, but its main goal is to provide insight [@card.etal_1999]. Indeed, these tools have provided a valuable interface to an otherwise inscrutable mass of data.
NephoVis offers an informative path from the organization of models to the organization of tokens, representing abstract differences generated by complicated algorithms as intuitive distances between points on a screen. Selecting different kinds of models and moving back and forth between different levels of granularity is just a click away and incorporates various sources of information simultaneously: find all models with window size of 5, look at them side by side, zoom in on the prettiest one, read a token, read the token next to it, find out its sense annotation, go back to the selection of models... Abstract corpus-based similarities between instances of a word, and between *ways* of representing these similarities (i.e. the models) become tangible, colourful clouds on a screen.
Most of the points discussed in the second part of this dissertation would have been simply impossible if it were not for these tools. Hopefully, they will prove at least half as valuable in future research projects.