Word embeddings are trained from unlabeled data, and then when new text comes in, a supervised task corpus, it's applied to a downstream task. But sometimes there's oov text in the supervised task text:
- Names
- Domain-specific jargon
- Foreign words
- Rare morphological derivations
- Nonce words
- Nonstandard orthography
- Typos and other errors
Common oov handling techniques:
- None, random init
- One UNK to rule them all
- Add subword model during WE training
- Char2Tag, add subword layer to supervised task. oovs benefit from co-trained character model
- MIMICK: use subword units as inputs, and pre-trained vectors as target
MIMICK is an oov-extension embedding processing step for downstream tasks. It's a powerful technique for low-resource scenarios, especially good for morphologically-rich languages and large character vocabulary.