wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, Baevski et al., 2020

Paper, Tags: #audio

In machine learning, because it is infeasible to obtain so much labelled data for training, in NLP they use self-supervised learning, and then finetune on labeled data.

In this paper we present a framework wav2vec 2.0, for self-supervised learning of representations from raw audio data. We encode audio via a multi-layer CNN and then mask spans of the resulting latent speech representations, similar to masked language modeling. These are fed to a Transformer to build contextualized representations, discrete linguistic units. And then we fine-tune it on transcribed speech.

Our system can outperform the best semi-supervised methods while being conceptually simpler. Our results demonstrate the feasibility of ultra-low resource speech recognition. We have shown that speech recognition models can be built with very small amounts of annotated data at very good accuracy, which might be useful for low-resource languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2006.11477.md

2006.11477.md

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, Baevski et al., 2020

Paper, Tags: #audio

Files

2006.11477.md

Latest commit

History

2006.11477.md

File metadata and controls

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, Baevski et al., 2020

Paper, Tags: #audio