We encode rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, names, compounds and cognates and loanwords.
Translation is an open-vocabulary problem.
For word-level NMT models, the translation of oov (out of vocabulary) words has been addressed through a back-off to a dictionary look-up. But this makes assumptions that often don't hold true in practice. There isn't always a 1-to-1 correspondence between source and target words.
We want to operate on the level of subword units, we want to model open-vocabulary translation in the NMT network without requiring a back-off model for rare words.
We also adapt byte pair encoding (BPE), a compression algorithm, to the task of word segmentation. BPE allows an open vocabulary to be represented through a fixed-size vocabulary of variable-length character sequences.
The translation of some words is transparent, they're translatable even if they are novel words to the translator, based on a translation of known subword units like morphemes or phonemes. Word categories whose translation is potentially transparent:
- Named entities: Barack Obama
- Cognates and loanwords: claustrophobia
- Morphologically complex words: words containing multiple morphemes, formed via compound, affixation or inflection, can be translated by translating the morphemes separatedly: Sonnensystem
Our hypothesis is that a segmentation of rare words into appropriate subword units is sufficient to allow for the neural network to learn transparent translations, and to generalize this knowledge to translate and produce unseen words.
Instead of the classic use of the algorithm of merging frequent pairs of bytes, we merge character or character sequences.
First, we initialize the symbol vocabulary with the character vocabulary, and represent each word as a sequence of characters, + end of word symbol, which allows us to restore the original tokenization after translation.
Then, we iteratively count all symbol pairs and replace each occurrence of the most frequent pair ('A', 'B') with a new symbol ('AB').
Each merge operation produces a new symbol which represents a character n-gram. Frequent character n-grams (or whole words) are eventually merged into a single symbol, so BPE requires no shortlist. The final symbol vocab size is == size of initial vocabulary + number of merge operations (which is a hyperparameter of the algorithm).
We don't consider pairs that cross word boundaries. The algorithm can be run on the dictionary extracted from a text, with each word being weighted by its frequency.
Minimal algorithm:
# minimal BPE from arxiv 1508.07909
import re
import collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i], symbols[i+1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
vocab = {'l o w </w>': 5,
'l o w e r </w>': 2,
'n e w e s t </w>': 6,
'w i d e s t </w>': 3}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(best)
In practice, we increase efficiency by indexing all pairs, and updating data structures incrementally.
The difference to other compression algorithms like the Huffman encoding is that our symbol sequences are still interpretable as subword units, and the network can generalize to translate and produce new words on the basis of these subword units.
The minimal algorithm shows how to apply BPE in test time: we first split words into sequences of characters, then apply the learned operations to merge the characters into larger, known symbols. In the example, the OOV 'lower' would be segmented into 'low er.'.
We evaluate 2 methods for applying BPE: learning 2 independent encodings, one for source one for target vocabulary, or learning the encoding of the 2 vocabularies, called joint BPE. The former is more compact in terms of text and vocabulary size, and there's stronger guarantee that each subword unit has been seen in the training text of the respective language.
But the latter improves consistency between the source and target segmentation.
If we apply BPE independently, the same name might be segmented differently in the 2 languages, and it might be harder for the NN to learn a mapping between the subword units.
For languages with different alphabets, we transliterate the Russian vocabulary into Latin to learnt the joint BPE encoding, then transliterate the BPE merge operations back into Cyrillic.
We aim to answer 2 empirical questions
- Can we improve the translation of rare and unseen words in NMT by representing them via subword units?
- Which segmentation into subword units performs best in terms of vocabulary size, text size and translation quality?
(see paper)
The English→Russian examples show that the subword systems are capable of transliteration. However, transliteration errors do occur, either due to ambiguous transliterations, or because of non-consistent segmentations between source and target text which make it hard for the system to learn a transliteration mapping.
NMT systems are capable of open-vocabulary translation by representing rare and unseen words as a sequence of word units. this is simpler and more effective than using a back-off translation model.
We introduce a variant of byte pair encoding for word segmentation, which is capable of encoding open vocabularies with a compact symbol vocabulary of variable-length subword units.
oov and rare in-vocabulary words are translated poorly by our baseline NMT system, and reducing the vocabulary size of subword models can improve performance. Future work, learn the optimal vocabulary size for a translation task, which we expect to depend on the language pair and amount of training data.