Experimental T5 Pre-Trained Model Checkpoints

Below are some pointers to checkpoints for experimental models we have trained after writing our paper. We have found that these models can produce better performance in some cases. These checkpoints are not officially supported - use at your own risk!

t5.1.1.*

Similar to the models described in our paper, with the following improvements:

GEGLU activation in feed-forward hidden layer, rather than ReLU - see https://arxiv.org/abs/2002.05202 .
Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning.
Pre-trained on C4 only without mixing in the downstream tasks.
no parameter sharing between embedding and classifier layer
"xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger d_model and smaller num_heads and d_ff.

The checkpoints are located here:

t5.1.1.small (~77 million parameters): gs://t5-data/pretrained_models/t5.1.1.small
t5.1.1.base (~250 million parameters): gs://t5-data/pretrained_models/t5.1.1.base
t5.1.1.large (~800 million parameters): gs://t5-data/pretrained_models/t5.1.1.large
t5.1.1.xl (~3 billion parameters): gs://t5-data/pretrained_models/t5.1.1.xl
t5.1.1.xxl (~11 billion parameters): gs://t5-data/pretrained_models/t5.1.1.xxl

LM-Adapted: t5.1.1.lm100k

These "LM adapted" models are initialized from t5.1.1 (above) and train for an additional 100K steps on the LM objective discussed in the T5 paper. This adaptation improves the ability of the model to be used for prompt tuning.

t5.1.1.lm100k.small (~77 million parameters): gs://t5-data/pretrained_models/t5.1.1.lm100k.small
t5.1.1.lm100k.base (~250 million parameters): gs://t5-data/pretrained_models/t5.1.1.lm100k.base
t5.1.1.lm100k.large (~800 million parameters): gs://t5-data/pretrained_models/t5.1.1.lm100k.large
t5.1.1.lm100k.xl (~3 billion parameters): gs://t5-data/pretrained_models/t5.1.1.lm100k.xl
t5.1.1.lm100k.xxl (~11 billion parameters): gs://t5-data/pretrained_models/t5.1.1.lm100k.xxl

Talking Heads: t5.1.th.*

Variation on the t5.1.1 models using talking-heads attention (https://arxiv.org/abs/2003.02436).

t5.1.th.base (~250 million parameters): gs://t5-data/pretrained_models/t5.1.th.base
t5.1.th.large (~800 million parameters): gs://t5-data/pretrained_models/t5.1.th.large

First Layers Narrow: t5.1.n4w10.*

Variation on the t5.1.1 models. Each of the encoder and decoder consists of 14 layer groups, with the last ten twice as "wide" as the first four. (double d_ff and num_heads). Parameter count and computation are kept similar to the corresponding t5.1.1 models. For the base model, this increases the number of layers, resulting in better quality, and for the large and xl models, this decreases the number of layers from 24 to 14, decreasing quality, but also decreasing the amount of communication necessary for model parallelism.

t5.1.n4w10.base (~250 million parameters): gs://t5-data/pretrained_models/t5.1.n4w10.base
t5.1.n4w10.large (~800 million parameters): gs://t5-data/pretrained_models/t5.1.n4w10.large
t5.1.n4w10.xl (~3 billion parameters): gs://t5-data/pretrained_models/t5.1.n4w10.xl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

released_checkpoints.md

released_checkpoints.md

Experimental T5 Pre-Trained Model Checkpoints

t5.1.1.*

LM-Adapted: t5.1.1.lm100k

Talking Heads: t5.1.th.*

First Layers Narrow: t5.1.n4w10.*

Files

released_checkpoints.md

Latest commit

History

released_checkpoints.md

File metadata and controls

Experimental T5 Pre-Trained Model Checkpoints

t5.1.1.*

LM-Adapted: t5.1.1.lm100k

Talking Heads: t5.1.th.*

First Layers Narrow: t5.1.n4w10.*