This page includes instructions for reproducing results from the paper Scaling Neural Machine Translation (Ott et al., 2018).
Model | Description | Dataset | Download |
---|---|---|---|
transformer.wmt14.en-fr |
Transformer (Ott et al., 2018) |
WMT14 English-French | model: download (.tar.bz2) newstest2014: download (.tar.bz2) |
transformer.wmt16.en-de |
Transformer (Ott et al., 2018) |
WMT16 English-German | model: download (.tar.bz2) newstest2014: download (.tar.bz2) |
First download the preprocessed WMT'16 En-De data provided by Google.
Then:
TEXT=wmt16_en_de_bpe32k
mkdir -p $TEXT
tar -xzvf wmt16_en_de.tar.gz -C $TEXT
fairseq-preprocess \
--source-lang en --target-lang de \
--trainpref $TEXT/train.tok.clean.bpe.32000 \
--validpref $TEXT/newstest2013.tok.bpe.32000 \
--testpref $TEXT/newstest2014.tok.bpe.32000 \
--destdir data-bin/wmt16_en_de_bpe32k \
--nwordssrc 32768 --nwordstgt 32768 \
--joined-dictionary \
--workers 20
fairseq-train \
data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--dropout 0.3 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--fp16
Note that the --fp16
flag requires you have CUDA 9.1 or greater and a Volta GPU or newer.
If you want to train the above model with big batches (assuming your machine has 8 GPUs):
- add
--update-freq 16
to simulate training on 8x16=128 GPUs - increase the learning rate; 0.001 works well for big batches
fairseq-generate \
data-bin/wmt16_en_de_bpe32k \
--path checkpoints/checkpoint_best.pt \
--beam 4 --lenpen 0.6 --remove-bpe
@inproceedings{ott2018scaling,
title = {Scaling Neural Machine Translation},
author = {Ott, Myle and Edunov, Sergey and Grangier, David and Auli, Michael},
booktitle = {Proceedings of the Third Conference on Machine Translation (WMT)},
year = 2018,
}