A PyTorch implementation of "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention".
Requirements:
- pytorch >= 1.3
- librosa
- scipy
- numpy
- matplotlib
- unidecode
- tqdm
Optional:
- simpleaudio and num2words, if you want to run
realtime.py
- nltk for better text processing
For audio preprocessing I mainly used Kyubyong's DCTTS code. I trained the model on the LJSpeech Dataset and the german samples from the CSS10 Dataset. You can find pretrained models below.
If you want want to train a model, you need to prepare your dataset:
-
Create a directory
data
for your dataset and a sub directorydata/wav
containing all your audio clips. -
Run
audio_processing.py -w data/wav -m data/mel -l data/lin
. -
Create a text file
data/lines.txt
containing the transcription of the audio clips in the following format:my-wav-file-000|Transciption of file my-wav-file-000.wav my-wav-file-001|Transciption of file my-wav-file-001.wav ...
Note that you don't need to remove umlauts or accents like ä, é, î, etc. This will be done automatically. If your transcipt contains abbreviations or numbers on the other hand, you will need to spell them out. For spelling out numbers you can install
num2words
and usespell_out_numbers
from the scripttext_processing.py
.
After preparing the dataset you can start training the Text2Mel and SSRN networks. Run
train_text2mel.py -d path/to/dataset
train_ssrn.py -d path/to/dataset
By default, checkpoints will be saved every 10,000 steps, but you can also set -save_iter
for a custom value.
If you want to continue training from a checkpoint, use -r save/checkpoint-xxxxx.pth
.
For other options run train_text2mel.py -h
, train_ssrn.py -h
and have a look at config.py
.
There are two scripts for generating audio:
With realtime.py
you can type sentences in the terminal and the computer will read it out aloud.
Run realtime.py --t2m text2mel-checkpoint.pth --ssrn ssrn-checkpoint.pth --lang en
.
With synthesize.py text.txt
you can generate a wav file from a given text file. Run it with the following arguments:
--t2m
,--ssrn
,-o
: paths to the saved networks and output file (optional)--max_N
: The text file will be split into chunks not longer than this length (optional). If not given, it will pick the value used for training inconfig.py
. Reducing this value might improve audio quality, but increases generating time for longer texts and introduces breaks in sentences.--max_T
: Number of mel frames to generate for each chunk (optional). If the endings of sentences are cut off, increase this value.--lang
: Language of the text (optional). Defaults toen
and will be used to spell out numbers occuring in the text.
See here. All samples were generated with the models below.
Lanuage | Dataset | Text2Mel | SSRN |
---|---|---|---|
English | LJ Speech | 350k steps | 350k steps |
German | CSS10 | 150k steps | 100k steps |
- I use layer norm, dropout and learning rate decay during training.
- The audio quality seems to deteriorate at the end of generated audio samples. A workaround would be to set a low value
for
--max_N
to reduce the length for each sample.
- The audio preprocessing uses Kyubyong's DCTTS code. This repo also helped me with some difficulties I had during the implementation.
- Also see this other PyTorch implementation.