Learning Semantic Representations for Novel Words: Leveraging Both Form and Context, Schick et al., 2018
Paper, Tags: #nlp, #embeddings
Currently there are 2 approaches for learning embeddings of novel words:
- Learning an embedding from the novel word's surface-form (subword n-grams)
- Learning an embedding from the context in which it occurs.
One of the problems of embeddings is that they need many observations of a word for the embedding to be reliable, so they struggle with small corpora and infrequent words. And because models are trained with a fixed vocabulary, they can't deal with oov words in inference. Some solutions include:
- using subword information, exploiting information that can be extracted from the surface-form of the word
- unemployable (?) -> un, employ, able
- using context information, inferring embeddings for novel words from the words surrounding them
In this paper we leverage both sources of information and show that it results in large increases in embedding quality. As input we only require an embedding set and an unlabened corpus.