SentencePiece normalization #79

ZJaume · 2023-03-06T16:01:14Z

Talking about other things that SentencePiece does, it has some other features that may replace pre-post-process.sh scripts. By default it applies NFKC normalization, but can be customized. The default normalization already does some of the preprocess.sh like:

echo "２" | spm_encode --model isen.student.base/vocab.spm

▁2

If the user needs to add more normalization or change it, it can be borrowed from here https://github.com/google/sentencepiece/tree/master/data, modify it and provide it in the spm_train step and forget about preprocessing.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentencePiece normalization #79

SentencePiece normalization #79

ZJaume commented Mar 6, 2023

SentencePiece normalization #79

SentencePiece normalization #79

Comments

ZJaume commented Mar 6, 2023