Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates, Kudo, 2018
Paper, Tags: #nlp, #embeddings
The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT.
We present a regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. We show that subword regularization achieves significant improvements over the method using a single subword sequence.
Also, for better subword sampling, we propose a new subword segmenation algorithm based on a unigram language model.
To predict the target subword, a common choice is to use a RNN architecture. NMT is trained using the standard maximum likelihood estimation, maximizing the log-likelihood of a given parallel corpus.
We can't optimize the marginal log-likelihood because the number of possible segmentations increases exponentially with respect to the sentence length. We approximate it with finite 1 sequences sampled from P(x|X) and P(y|Y) respectively.
BPE is based on a greedy and deterministic symbol replacement, which can't provide multiple segmentations with probabilities.
The unigram LM assumes that each subword occurs independently, and therefore the probability of a subword sequence is the product of the subword occurrence probabilities.
If the vocabulary is given we can estimate subword occurrence probabilities via the EM algorithm, but in real setting we don't know the vocabulary size. We find it with an iterative algorithm (see paper section 3.2).
The unigram LM model is more flexible than BPE because it's based on a probabilistic LM and can output multiple segmentations with their probabilities.
We virtually augment training data with on-the-fly subword sampling, which helps improve the accuracy and robustness of NMT models.