Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts, Schick et al., 2019
Paper, Tags: #embeddings
Learning embeddings for rare words is a hard problem because of sparse context information. Mimicking: given embeddings learned by a standard algorithm, a model is first trained to reproduce embeddings of frequent words from their surface form and then used to compute embeddings for rare words. This method only works if a word's meaning can be at least partially predicted from its form.
In this paper we introduce attentive mimicking: the mimicking model is given access not only to a word’s surface form, but also to all available contexts and learns to attend to the most informative and reliable contexts for computing an embedding.
Schick and Schütze (2019) introduce the form-context model and show that joint learning from both surface form and context leads to better performance.
But often few of the context words provide valuable information to the word's meaning, yet models treat all contexts words as the same. In our paper, instead of using all contexts, we learn - by way of self-attention - to pick a subset of especially informative and reliable contexts: we call the architecture attentive mimicking.
It produces high-quality embeddings for rare and medium-frequency words by attending to the most informative contexts.
Limitations:
- For processing contexts, it uses a simple BoW model, making poor use of available information
- It combines form and context in a shallow fashion.