Skip to content

Improved Vocoder through end-to-end Training

Compare
Choose a tag to compare
@Flux9665 Flux9665 released this 22 Sep 14:41
· 46 commits to MassiveScaleToucan since this release

This is a minor release that does not change much, however the impact on the audio quality is pretty big. Previously, we trained the vocoder on spectrograms extracted from ground-truth audio files. Despite our best efforts of making the synthetic spectrograms as natural as possible, there always remained a gap between the synthetic spectrograms and the genuine spectrograms. We previously thought, that the impact of this was very small and unimportant, but upon experimentation we now noticed, that it actually does make a big difference. The vocoder is the last piece of the pipeline, it is what actually connects the generated spectrograms to your ears. So using a bad vocoder is a bit like putting cheap tires on an expensive sports car.

We have finally fixed this by generating synthetic spectrograms and then training the vocoder on these. We also increased the capacity of the vocoder by giving it more parameters. The inference speed is still just as fast, but the quality is improved greatly. This release is part of the cleanup towards finishing my PhD thesis and is not associated with its own paper or proper evaluation besides internal pilot studies.

Update your code and run the model downloader script for an immediate major quality boost.

Full Changelog: v3.1...v3.1.1