Training (and for other languages)

Datasets

You need a lot of memory(>512 GB)

Ideally, the source datasets are stored in the same <datasets_root> directory (by default, in the "d" folder). All prepreprocessing scripts will, by default, output the clean data to a new directory "d / SV2TTS". Inside this directory will be created a directory for each model: the encoder, synthesizer and vocoder.

Name	Language	Link	Comments	My link	Comments
Phoneme dictionary	En, Ru	En,Ru	Phoneme dictionary	link	Совместил русский и английский фонемный словарь
LibriSpeech	En	link	300 speakers, 360h clean speech
VoxCeleb	En	link	7000 speakers, many hours bad speech
M-AILABS	Ru	link	3 speakers, 46h clean speech
open_tts, open_stt	Ru	open_tts, open_stt	many speakers, many hours bad speech	link	Почистил 4 часа речи одного спикера. Поправил анотацию, разбил на отрезки до 7 секунд
Voxforge+audiobook	Ru	link	Many speaker, 25h various quality	link	Выбрал хорошие файлы. Разбил на отрезки. Добавил аудиокниг из интернета. Получилось 200 спикеров по паре минут на каждого
RUSLAN	Ru	link	One speaker, 40h good speech	link	Перекодировал в 16кГц
Mozilla	Ru	link	50 speaker, 30h good speech	link	Перекодировал в 16кГц, Раскидал разных пользователей по папкам
Russian Single	Ru	link	One speaker, 9h good speech	link	Перекодировал в 16кГц

For g2p models need a dictionary phonemes for your language. where data is represented as strings "stockham с т А к h a м"

For the encoder you need a LOT of sound, where each speaker is placed in a separate folder. Fortunately, you can use untagged data with noises. If you do not have enough data for your language, you can use, for example, English. It's not that important to the coder.

Synthesis requires a lot of clean, well-marked sound from different speakers

The vocoder operates on the synthesized Mel, so it is, preferably, also clean, well-groomed data.

If you want to build a model for several languages at once, think about number of phonemes. The more of them there are, the harder it is for models to learn. But if they are too few, the model will have an accent. Think about what phonemes in your languages sound like. And do not forget to highlight the stressed vowel individual characters. For English, secondary stress plays a small role, and I would single it out.

G2P

Open g2p/train.py and edit class Hparams(for other languages).
Copy dictionary in folder g2p
Run python g2p

Encode

For training, the encoder uses visdom. You can disable it with --no_visdom, but it's nice to have.

It is not necessary train from scratch (even for other languages). Take the pre-trained model.

Run python encoder_preprocess.py <datasets_root> for data processing
Run "visdom" in a separate CLI/process to start your visdom server
Запустите python encoder_train.py my_run <datasets_root> for train encoder

Synthesizer

Open "synthesizer/hparams.py and edit by itself(Especially if you have a sound frequency at 16 kHz or error OOM)
Open "synthesizer/utils/symbols.py and edit _characters for yourself(for other languages)
Run python synthesizer_preprocess_audio.py <datasets_root> to create processed sound and spectrograms
Run python synthesizer_preprocess_embeds.py <datasets_root> for audio coding(obtain the characteristics of the voice)
Run python synthesizer_train.py my_run <datasets_root> for train synthesizer

Vocoder

Run python vocoder_preprocess.py <datasets_root> for the synthesis of Mel spectrograms
Run python vocoder_train.py <datasets_root> for train vocoder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training (and for other languages)

Datasets

G2P

Encode

Synthesizer

Vocoder

Clone this wiki locally