Paper, Tags: #nlp, #architectures
This paper presents a new sequence-to-sequence pre-training model called ProphetNet, introducing a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism.
Instead of optimizing one-step-ahead prediction in the traditional seq2seq model, we predict the next n tokens simultaneousl based on previous context tokens at each time step. The model is encouraged to plan for the future tokens and is prevented from overfitting on strong local correlations. We train it with a base scale dataset of 16GB and a large-scale dataset of 160GB.
ProphetNet is based on Transformer seq2seq architecture, but it contains a multi-layer Transformer encoder with the multi-head self-attention mechanism and a multi-layer Transformer decoder with the proposed multi-head n-stream self-attention mechanism.
ProphetNet achieves the best performance on both abstractive summarization and question generation tasks. And it achieves a new state-of-the-art results on CNN/DailyMail and Gigaword using only about 1/3 of the training epochs of the previous model.