Transformer NLP

Demos

Here are a few info:

In Natural Language Processing:

Encoder Only

Decoder Only

Encoder Decoder

Knowledge Distillation

Roberta distilation
Roberta Distilation classification
Knowledge distilation from large language model
Initializing a model with different embedding size and layer from large model through Teacher-Student pretraining

Model architectures

RoBERTa (from Facebook)