Mimicking Word Embeddings using Subword RNNs, Pinter et al., 2017

Paper, Code, Tags: #nlp, #embeddings

Word embeddings are trained from unlabeled data, and then when new text comes in, a supervised task corpus, it's applied to a downstream task. But sometimes there's oov text in the supervised task text:

Names
Domain-specific jargon
Foreign words
Rare morphological derivations
Nonce words
Nonstandard orthography
Typos and other errors

Common oov handling techniques:

None, random init
One UNK to rule them all
Add subword model during WE training
Char2Tag, add subword layer to supervised task. oovs benefit from co-trained character model
MIMICK: use subword units as inputs, and pre-trained vectors as target

MIMICK is an oov-extension embedding processing step for downstream tasks. It's a powerful technique for low-resource scenarios, especially good for morphologically-rich languages and large character vocabulary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1709.1010.md

1709.1010.md

Mimicking Word Embeddings using Subword RNNs, Pinter et al., 2017

Paper, Code, Tags: #nlp, #embeddings

Files

1709.1010.md

Latest commit

History

1709.1010.md

File metadata and controls

Mimicking Word Embeddings using Subword RNNs, Pinter et al., 2017

Paper, Code, Tags: #nlp, #embeddings