Paper, Tags: #nlp
To the best of our knowledge, the literature doesn't contain a direct evaluation of the impact of tokenization of language model pretraining.
We analyze differences between BPE and unigram LM tokenization, and we find that the unigram LM method can recover subword units that more strongly align with underlying morphology, in addition to avoiding several shortcomings of BPE stemming from its greedy construction procedure.
Then we compare the fine-tuned task performance of identical transformer masked LMs pretrained with these tokenizations. We find that the unigram LM tokenization method consistently matches or outperforms BPE.
Critically, a pretrained language model’s subword vocabulary cannot be altered: any downstream application of these models must tokenize input or generate output using the original subword vocabulary, making the choice of tokenization a particularly significant decision.
In unigram LM, tokens are pruned based on unigram LM perplexity. The vocabularies resulting from BPE and ULM are heavily overlapping, but we find that ULM produces more plausible subword units in long words.
The WordPiece algorithm closely resembles BPE but instead of merging the most frequent token bigram, each potential merge is scored based on the likelihood of a LM trained on a version of the corpus incorporating that merge. Estimating the LM parameters for every potential merge is computationally very expensive, so they employ aggressive heuristics to reduce the potential merges considered. The implementation isn't public so the method can't be compared to the others.
The unigram LM method (Kudo, 2018), in contrast to the bottom-up construction process of BPE and WordPiece, begins with a superset of the final vocabulary, pruning it to the desired size (see algorithm in paper). Unigram LM tokenization takes the vocabulary V and unigram LM parameters θ and performs Viterbi inference to decode the segmentation with maximum likelihood under θ.
Recognizable affixes appear much more frequently in ULM. As the BPE tokenization is constructed greedily according to frequency, common affixes and punctuation) are frequently absorbed into other tokens. BPE vocabulary still includes these affixes, but when they are encountered during tokenization, they are almost always merged into larger units.
For example, in the case where three tokens almost always occur as a group, in order to merge them into a single token, BPE must first merge one pair before incorporating the third token; this leaves an intermediate token in the vocabulary that will only occur rarely on its own. The unigram LM method avoids this pathology by considering all potential subwords in a top-down fashion during vocabulary selection.
The unigram LM tokenization tends to have longer subword units than BPE. This is closer to the length distribution of gold-standard English morphs, which have a mean length of approximately 6 characters.