wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, Baevski et al., 2020
Paper, Tags: #audio
In machine learning, because it is infeasible to obtain so much labelled data for training, in NLP they use self-supervised learning, and then finetune on labeled data.
In this paper we present a framework wav2vec 2.0, for self-supervised learning of representations from raw audio data. We encode audio via a multi-layer CNN and then mask spans of the resulting latent speech representations, similar to masked language modeling. These are fed to a Transformer to build contextualized representations, discrete linguistic units. And then we fine-tune it on transcribed speech.
Our system can outperform the best semi-supervised methods while being conceptually simpler. Our results demonstrate the feasibility of ultra-low resource speech recognition. We have shown that speech recognition models can be built with very small amounts of annotated data at very good accuracy, which might be useful for low-resource languages.