VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

** Update note: Thanks to Rishikesh (ऋषिकेश), our interactive TTS demo is now available on Colab Notebook.

VITS at training	VITS at inference

Pre-requisites

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

Training Exmaple

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb

补充说明

项目特点

支持Windows和Linux，两个平台上都可以进行训练和推断
兼容最新版本的各个依赖库
Windows平台所需特殊环境配置和操作说明
支持中文和英文
本项目添加了一个简易的面向对象风格的推断脚本。
这里是一个简单的Colab notebook，展示了如何使用该项目进行训练和推断的步骤。
这里是一个简单的Colab notebook，展示了如何使用预训练权重进行迁移训练（精调）
预处理好的几套音频数据集以方便大家学习实验

Windows平台环境配置

安装PyTorch的GPU版本

在Windows平台，pip install -r requirements.txt 安装的是CPU版本的PyTorch。所以需要去PyTorch官网挑选并运行合适的GPU版本PyTorch安装命令。下面命令仅供参考：

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

eSpeak的配置

在Windows平台上用英文做训练或推断的话，需要安装eSpeak Ng库。这里是下载页面，推荐使用.msi安装。
安装eSpeak Ng后，请添加环境变量PHONEMIZER_ESPEAK_LIBRARY，并将变量值设置为{INSTALLDIR}\libespeak-ng.dll。如图所示：

构建Monotonoic Alignment Search扩展模块

请先下载安装Visual Studio。到这里下载。

数据集

标贝中文标准女声音库（处理后）16-bit PCM WAV，22050 Hz	链接：https://pan.baidu.com/s/1oihti9-aoJ447l54kdjChQ 提取码：vits
LJSpeech数据集16-bit PCM WAV，22050 Hz	链接：https://pan.baidu.com/s/1q2A38znFmxn3zCn587ZKkw 提取码：vits
标贝中文标准女声音库官网	https://www.data-baker.com/data/index/TNtts/
LJSpeech数据集官网	https://keithito.com/LJ-Speech-Dataset/

预训练权重

标贝中文标准女声音库预训练权重

链接：https://pan.baidu.com/s/1pN-wL_5wB9gYMAr2Mh7Jvg
提取码：vits

注：各预训练权重文件包括生成网络权重（G开头），鉴别器网络权重（D开头），还有训练时使用的cleaners与symbols（方便与其他VITS仓库的代码或工具兼容）

效果展示

Gallery

参考与鸣谢

大佬们的VITS语音合成GitHub仓库

参考B站链接

【CV失业计划】基于VITS神经网络模型的近乎完美派蒙中文语音合成：
https://www.bilibili.com/video/BV1rB4y157fd
【原神】派蒙Vtuber出道计划——基于AI深度学习VITS和VSeeFace的派蒙语音合成/套皮：
https://www.bilibili.com/video/BV16G4y1B7Ey
【深度学习】基于vits的语音合成：
https://www.bilibili.com/video/BV1Fe4y1r737
零基础炼丹 - vits版补充：
https://www.bilibili.com/read/cv18357171

Sieroy再补充

更新

实际运行时遇到了librosa0.10的一些新功能（恨），不得不改一些东西以作适配。
为菲米莉丝小天使准备了一下配置文件和filelists，并通过游戏内录音+Au手动分割+格式化得到了一些语音数据，以作微调。

其他说明

这个repo是从rotten-work/vits-mandarin-windows Fork出来的，感谢这位喵喵抽风是大摆锤大佬的预训练模型和项目。欢迎各位去TA的repo瞻仰+投喂。我这里就不再放TA的投喂码了23333。
如果你喜欢菲米莉丝的日语版本，推荐尝试这个repo:Plachtaa/VITS-fast-fine-tuning，这位大佬使用了比较二次元的数据集，微调出来的模型在日语发音方面还是很不错的。
为避免版权纠纷等，我不会放出模型和音频数据，但你可以使用喵喵抽风是大摆锤大佬的项目，通过自己录制菲米莉丝的声音+训练，来获得模型。

最后，菲门🙏

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
configs		configs
filelists		filelists
gallery		gallery
monotonic_align		monotonic_align
pretrained_weights		pretrained_weights
resources		resources
text		text
tools		tools
.gitignore		.gitignore
GpuShare Ruby.ipynb		GpuShare Ruby.ipynb
Gpushare.ipynb		Gpushare.ipynb
LICENSE		LICENSE
README.md		README.md
VITS.pdf		VITS.pdf
attentions.py		attentions.py
commons.py		commons.py
data_utils.py		data_utils.py
inference.ipynb		inference.ipynb
inference.py		inference.py
inference_cmn.ipynb		inference_cmn.ipynb
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
train_ms.py		train_ms.py
transforms.py		transforms.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

Pre-requisites

Training Exmaple

Inference Example

补充说明

项目特点

Windows平台环境配置

安装PyTorch的GPU版本

eSpeak的配置

构建Monotonoic Alignment Search扩展模块

数据集

预训练权重

效果展示

Gallery

参考与鸣谢

大佬们的VITS语音合成GitHub仓库

参考B站链接

Sieroy再补充

更新

其他说明

About

Releases

Packages

Languages

License

Sieroy/vits-femirins

Folders and files

Latest commit

History

Repository files navigation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

Pre-requisites

Training Exmaple

Inference Example

补充说明

项目特点

Windows平台环境配置

安装PyTorch的GPU版本

eSpeak的配置

构建Monotonoic Alignment Search扩展模块

数据集

预训练权重

效果展示

Gallery

参考与鸣谢

大佬们的VITS语音合成GitHub仓库

参考B站链接

Sieroy再补充

更新

其他说明

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages