Skip to content

Latest commit

 

History

History
23 lines (17 loc) · 1.03 KB

1709.1010.md

File metadata and controls

23 lines (17 loc) · 1.03 KB

Mimicking Word Embeddings using Subword RNNs, Pinter et al., 2017

Paper, Code, Tags: #nlp, #embeddings

Word embeddings are trained from unlabeled data, and then when new text comes in, a supervised task corpus, it's applied to a downstream task. But sometimes there's oov text in the supervised task text:

  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare morphological derivations
  • Nonce words
  • Nonstandard orthography
  • Typos and other errors

Common oov handling techniques:

  • None, random init
  • One UNK to rule them all
  • Add subword model during WE training
  • Char2Tag, add subword layer to supervised task. oovs benefit from co-trained character model
  • MIMICK: use subword units as inputs, and pre-trained vectors as target

MIMICK is an oov-extension embedding processing step for downstream tasks. It's a powerful technique for low-resource scenarios, especially good for morphologically-rich languages and large character vocabulary.