One of the sources for empirical research in linguistics is corpora: large collections of machine-readable texts. They contain authentic usage by multiple people, allowing us to sample how language is used within a community, e.g. which expressions are used, what meanings they take, etc. However, the different meanings of a word are not normally coded in the text, and identifying them requires manual annotation, which takes a lot of time, energy and resources. This PhD investigates the possibility of applying a computational technique, namely distributional modelling, to descriptive semantic research. This technique represents words or instances of words as numbers based on their behaviour in corpora, based on the hypothesis that words that occur in similar contexts have similar meanings. The full procedure results in 2D scatterplots where each point is an occurrence of a word, and points that are close to each other occur in similar context and thus have, according to the Distributional Hypothesis, similar meanings. The different shapes that emerge in these scatterplots are called clouds.
As part of the investigation, we have developed an interactive visualization tool for a thorough exploration of the results: (1) how different settings impact the results and (2) to what extent meaning can be modelled with this technique. The first conclusion is that each setting has a different impact on each word, and no configuration gives the best result across the board. The second conclusion is that points that are close together occur in similar contexts but may not have the same meaning, and points that are far from each other because they occur in different contexts could still have the same meaning. Nevertheless, the visualization tool offers a way of exploring the results on different case studies and extracting semantic information that goes beyond the identification of different meanings.