-
Notifications
You must be signed in to change notification settings - Fork 96
Training (and for other languages)
You need a lot of memory(>512 GB)
Ideally, the source datasets are stored in the same <datasets_root> directory (by default, in the "d" folder). All prepreprocessing scripts will, by default, output the clean data to a new directory "d / SV2TTS". Inside this directory will be created a directory for each model: the encoder, synthesizer and vocoder.
Name | Language | Link | Comments | My link | Comments |
---|---|---|---|---|---|
Phoneme dictionary | En, Ru | En,Ru | Phoneme dictionary | link | Совместил русский и английский фонемный словарь |
LibriSpeech | En | link | 300 speakers, 360h clean speech | ||
VoxCeleb | En | link | 7000 speakers, many hours bad speech | ||
M-AILABS | Ru | link | 3 speakers, 46h clean speech | ||
open_tts, open_stt | Ru | open_tts, open_stt | many speakers, many hours bad speech | link | Почистил 4 часа речи одного спикера. Поправил анотацию, разбил на отрезки до 7 секунд |
Voxforge+audiobook | Ru | link | Many speaker, 25h various quality | link | Выбрал хорошие файлы. Разбил на отрезки. Добавил аудиокниг из интернета. Получилось 200 спикеров по паре минут на каждого |
RUSLAN | Ru | link | One speaker, 40h good speech | link | Перекодировал в 16кГц |
Mozilla | Ru | link | 50 speaker, 30h good speech | link | Перекодировал в 16кГц, Раскидал разных пользователей по папкам |
Russian Single | Ru | link | One speaker, 9h good speech | link | Перекодировал в 16кГц |
For g2p models need a dictionary phonemes for your language. where data is represented as strings "stockham с т А к h a м"
For the encoder you need a LOT of sound, where each speaker is placed in a separate folder. Fortunately, you can use untagged data with noises. If you do not have enough data for your language, you can use, for example, English. It's not that important to the coder.
Synthesis requires a lot of clean, well-marked sound from different speakers
The vocoder operates on the synthesized Mel, so it is, preferably, also clean, well-groomed data.
If you want to build a model for several languages at once, think about number of phonemes. The more of them there are, the harder it is for models to learn. But if they are too few, the model will have an accent. Think about what phonemes in your languages sound like. And do not forget to highlight the stressed vowel individual characters. For English, secondary stress plays a small role, and I would single it out.
- Open g2p/train.py and edit class Hparams(for other languages).
- Copy dictionary in folder g2p
- Run
python g2p
For training, the encoder uses visdom. You can disable it with --no_visdom, but it's nice to have.
It is not necessary train from scratch (even for other languages). Take the pre-trained model.
- Run
python encoder_preprocess.py <datasets_root>
for data processing - Run "visdom" in a separate CLI/process to start your visdom server
- Запустите
python encoder_train.py my_run <datasets_root>
for train encoder
- Open "synthesizer/hparams.py and edit by itself(Especially if you have a sound frequency at 16 kHz or error OOM)
- Open "synthesizer/utils/symbols.py and edit _characters for yourself(for other languages)
- Run
python synthesizer_preprocess_audio.py <datasets_root>
to create processed sound and spectrograms - Run
python synthesizer_preprocess_embeds.py <datasets_root>
for audio coding(obtain the characteristics of the voice) - Run
python synthesizer_train.py my_run <datasets_root>
for train synthesizer
- Run
python vocoder_preprocess.py <datasets_root>
for the synthesis of Mel spectrograms - Run
python vocoder_train.py <datasets_root>
for train vocoder