Releases · DigitalPhonetics/IMS-Toucan

07 Oct 13:22

Flux9665

v3.1.2

e18c227

Latest

This release includes a new GUI that allows you to control exactly how an utterance sounds.

You can generate a bunch of different realizations until you get one that you like. Then you can modify it further by dragging around the pitch values and the durations of individual phones. You can also exchange the voice for a different one while keeping your changes to the intonation and duration exactly as they are. And of course you can do so in over 7000 languages.

Just update the new requirements and run the run_advanced_GUI_demo.py script. By default it will load the pretrained models from Hugging Face🤗, but you can also specify our own.

Assets 2

22 Sep 14:41

Flux9665

v3.1.1

2795dfd

Improved Vocoder through end-to-end Training

This is a minor release that does not change much, however the impact on the audio quality is pretty big. Previously, we trained the vocoder on spectrograms extracted from ground-truth audio files. Despite our best efforts of making the synthetic spectrograms as natural as possible, there always remained a gap between the synthetic spectrograms and the genuine spectrograms. We previously thought, that the impact of this was very small and unimportant, but upon experimentation we now noticed, that it actually does make a big difference. The vocoder is the last piece of the pipeline, it is what actually connects the generated spectrograms to your ears. So using a bad vocoder is a bit like putting cheap tires on an expensive sports car.

We have finally fixed this by generating synthetic spectrograms and then training the vocoder on these. We also increased the capacity of the vocoder by giving it more parameters. The inference speed is still just as fast, but the quality is improved greatly. This release is part of the cleanup towards finishing my PhD thesis and is not associated with its own paper or proper evaluation besides internal pilot studies.

Update your code and run the model downloader script for an immediate major quality boost.

Full Changelog: v3.1...v3.1.1

Assets 3

25 Jul 07:16

Flux9665

v3.1

de0a1cb

Improved TTS in 7000 Languages

What's Changed

This release provides new checkpoints and improves some aspects of the previous release that were not included due to time constraints. For more information on the universal TTS model for 7000 languages, please refer to the previous release v3.0

Prosody prediction in terms of pitch, energy and durations are now stochastic and sample from a distribution instead of assuming a one-to-one mapping.
Added support for more IPA modifiers to cover more languages
Added more languages into the pretraining
Overhauled language similarity prediction modules and visualization

Full Changelog: v3.0...v3.1

Assets 6

10 Jun 15:21

Flux9665

v3.0

53742f8

TTS in 7000 Languages

This release extends the toolkits' functionality and provides new checkpoints.

We improved the overall TTS quality, with further enhancements already on their way
Watermarking is added to prevent misuse
We extend the support for almost all languages in the ISO-639-3 standard (that's over 7000 languages!)
With a few clever designs, we were able to extrapolate from a pretrained checkpoint using 462 languages to a checkpoint that can speak all languages for which we now support a text frontend!
Lots of simplifications and quality of life changes.

This is the outcome of a collaboration with colleagues from the University of Groningen and the Fraunhofer IIS in Erlangen. Together with our group from the University of Stuttgart, we have built this model, which is the first of its kind.

We will present this at the Interspeech 2024, the full list of authors is Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël Habets and Thang Vu.

Paper: https://arxiv.org/abs/2406.06403
Dataset: https://huggingface.co/datasets/Flux9665/BibleMMS
Interactive Demo: https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS
Static Demo: https://anondemos.github.io/MMDemo/

Assets 9

10 Jun 14:03

Flux9665

2.p

bb4755b

Prompting Controlled Emotional TTS

In this release you can condition your TTS model on emotional prompts during training and transfer the emotion in any prompt to synthesized speech during inference.

Demo samples are available at https://anondemos.github.io/Prompting/
A demo space is available at https://huggingface.co/spaces/Thommy96/promptingtoucan

Using pretrained models:
You can use the pretrained models for inference by simply providing an instance of the sentence embedding extractor, a speaker id and a prompt (see run_sent_emb_test_suite.py).

Training your own model:
You will need to extract a number of prompts and their sentence embeddings for all emotion categories which you want to include during training (see e.g. extract_yelp_sent_embs.py).
Then in your training pipeline you need to load these sentence embeddings and pass them to the train loop. You should also provide the dimensionality of the embeddings in the instantiation of the TTS model and set static_speaker_embedding=True (see TrainingInterfaces\TrainingPipelines\ToucanTTS_Sent_Finetuning.py). Depending on how many speakers there are in the datasets you use for training, you need to adapt the dimensionality of the speaker embedding table in the TTS model. Finally you should check if the datasets you use are included in the functions for extracting emotion and speaker id from the filepath (Utility\utils.py).

Assets 4

01 Dec 15:40

Flux9665

v2.asvspoof

9036272

ChallengeDataContribution Pre-release

Pre-release

v2.asvspoof

fix popping noise and incorrect path in downloader

Assets 10

10 Apr 18:22

Flux9665

v2.5

5f1dce3

ToucanTTS

We pack a bunch of designs into a new architecture, which will be the basis for our multilingual and low-resource research going forward. We call it ToucanTTS and as usual, provide pretrained models. The synthesis quality is very good and the training is very stable and requires few datapoints for training from scratch and even fewer for finetuning. It is hard to quantify these stats, so it's probably best to try it out yourself.

We also offer the option to use a BigVGAN vocoder, which sounds very nice, but is a bit slow on CPU. On GPU it is definitely recommended to use the new vocoder.

Assets 9

04 Apr 14:15

Flux9665

v2.b

9c0d819

Blizzard Challenge 2023

Our submission to the Blizzard Challenge 2023

Assets 5

22 Feb 17:08

Flux9665

v2.4

afbf183

Improved Controllable Multilingual

This release extends the toolkits functionality and provides new checkpoints.

new sampling rate for the vocoder: Using 24kHz instead of 48kHz lowers the theoretical upper bound for quality, but produces fewer artifacts in practice.
flow based postnet from portaspeech is included in the new TTS model which brings cleaner results at basically no expense
new controllability options through artificial speaker generation in a lower dimensional space with a better embedding function
quality of life changes, such as an integrated finetuning example and an arbiter for the train loops to be used and vocoder finetuning (although that should really not be necessary)
divese bugfixes and speed increases

This release breaks backwards compatibility, please download the new models or stick to a prior release if you rely on your old models.

Future releaes will include one more change to the vocoder used (BigVGAN generator) and lots of changes to scale up the multi-lingual capabilities of a single model.

Assets 8

25 Oct 15:16

Flux9665

v2.3

ee3b798

Controllable Speakers

This release extends the toolkits functionality and provides new checkpoints.

self contained embeddings: we no longer use an external embedding model for TTS conditioning. Instead we train one that is specifically tailored for this use.
new vocoder: Avocodo replaces HiFi-GAN
new controllability options through artificial speaker generation
quality of life changes, such as weights&biases integration, a graphic demo script and automated model downloading
divese bugfixes and speed increases

This release breaks backwards compatibility, please download the new models or stick to a prior release if you rely on your old models.

Assets 8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Releases: DigitalPhonetics/IMS-Toucan

GUI for precise control

Improved Vocoder through end-to-end Training

Improved TTS in 7000 Languages

What's Changed

TTS in 7000 Languages

Prompting Controlled Emotional TTS

ChallengeDataContribution

ToucanTTS

Blizzard Challenge 2023

Improved Controllable Multilingual

Controllable Speakers