Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to process the chinese text data #313

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
f46b19b
Chinese mandarin
begeekmyfriend Apr 13, 2018
62ecae8
Sync
begeekmyfriend Apr 18, 2018
990269f
Set reduction factor as 2 to reduce loss value
begeekmyfriend Apr 20, 2018
302d819
Sync
begeekmyfriend Apr 23, 2018
267215c
Enable postnet for predicting linear spectrogram
begeekmyfriend Apr 23, 2018
6fde5fb
Predict linear spectrograms
begeekmyfriend May 2, 2018
08a0ef9
Limit mel frame number in case of OOM
begeekmyfriend Jul 16, 2018
96b7ab8
Use bias in convolution
begeekmyfriend Jul 20, 2018
5fdb0d5
Use frame shift as 12.5ms
begeekmyfriend Jul 23, 2018
a0343dd
Update for griffin lim sythesis
begeekmyfriend Dec 10, 2018
2134a5a
Remove wavenet module
begeekmyfriend Dec 11, 2018
211152a
Support online synthesis
begeekmyfriend Dec 27, 2018
86d262a
Biaobei corpus
begeekmyfriend Dec 28, 2018
a3cadc3
Substitute L1 loss for L2 loss for quick alignment
begeekmyfriend Feb 22, 2019
408802b
Update
begeekmyfriend Feb 22, 2019
f3bdae8
Batch synthesis
begeekmyfriend Mar 1, 2019
8dc26d6
Decoding until all finish tokens stop
begeekmyfriend Mar 4, 2019
0800ed5
Update
begeekmyfriend Mar 7, 2019
7da0f09
Add guided attention loss
begeekmyfriend Mar 7, 2019
6f1e0d9
Add guided attention loss
begeekmyfriend Mar 7, 2019
55d2879
Add guided attention loss
begeekmyfriend Mar 7, 2019
35dcc33
Fix
begeekmyfriend Mar 7, 2019
ca250a5
Fix
begeekmyfriend Mar 8, 2019
f5b82ad
Update
begeekmyfriend Mar 11, 2019
4083fdf
Fix alignment size
begeekmyfriend Mar 27, 2019
4c59f36
Update
begeekmyfriend Mar 27, 2019
2dd3c2f
Update
begeekmyfriend Mar 29, 2019
2cfc9a9
MAE would lose attention alignment
begeekmyfriend Mar 30, 2019
eb6e446
Update
begeekmyfriend Apr 2, 2019
a050b19
Remove dropout in conv1d to decrease loss
begeekmyfriend Apr 15, 2019
ad72806
Update
begeekmyfriend Apr 15, 2019
3b77ae4
Update test sentences
begeekmyfriend Apr 22, 2019
6f5fe3f
Adapt ratio of N:T for guided attention
begeekmyfriend Apr 26, 2019
1ec3de8
GTA synthesis
begeekmyfriend Apr 28, 2019
d264583
GTA
begeekmyfriend May 5, 2019
ca1c343
Merge branch 'mandarin-griffin-lim' of https://github.com/begeekmyfri…
begeekmyfriend May 5, 2019
9401b41
Fix stop token prediction failure
begeekmyfriend May 13, 2019
0d52045
Fix
begeekmyfriend May 14, 2019
fa7f5e5
Trim each input text in case of stop token prediction
begeekmyfriend May 16, 2019
f647ac6
Update
begeekmyfriend Jun 20, 2019
9143de4
Text list
begeekmyfriend Jul 2, 2019
556e3e4
Add silence interval for synthesized wav
begeekmyfriend Jul 2, 2019
e58b029
80 dim mel spectrograms
begeekmyfriend Sep 24, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 58 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Tacotron-2:
Tensorflow implementation of Deep mind's Tacotron-2. A deep neural network architecture described in this paper: [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)


# Repository Structure:
Expand All @@ -15,10 +15,20 @@ Tensorflow implementation of Deep mind's Tacotron-2. A deep neural network archi
├── LJSpeech-1.1 (0)
│   └── wavs
├── logs-Tacotron (2)
│   ├── eval_-dir
│   │  ├── plots
│  │  └── wavs
│   ├── mel-spectrograms
│   ├── plots
│   ├── pretrained
│   └── wavs
├── logs-Wavenet (4)
│   ├── eval-dir
│   │  ├── plots
│  │  └── wavs
│   ├── plots
│   ├── pretrained
│   └── wavs
├── papers
├── tacotron
│   ├── models
Expand All @@ -30,26 +40,34 @@ Tensorflow implementation of Deep mind's Tacotron-2. A deep neural network archi
│   │   ├── plots
│   │   └── wavs
│   └── natural
├── wavenet_output (5)
│   ├── plots
│   └── wavs
├── training_data (1)
│   ├── audio
│   └── mels
│   ├── linear
│ └── mels
└── wavenet_vocoder
└── models




The previous tree shows what the current state of the repository.
The previous tree shows the current state of the repository (separate training, one step at a time).

- Step **(0)**: Get your dataset, here I have set the examples of **Ljspeech**, **en_US** and **en_UK** (from **M-AILABS**).
- Step **(1)**: Preprocess your data. This will give you the **training_data** folder.
- Step **(2)**: Train your Tacotron model. Yields the **logs-Tacotron** folder.
- Step **(3)**: Synthesize/Evaluate the Tacotron model. Gives the **tacotron_output** folder.
- Step **(4)**: Train your Wavenet model. Yield the **logs-Wavenet** folder.
- Step **(5)**: Synthesize audio using the Wavenet model. Gives the **wavenet_output** folder.


Note:
- **Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)!** If running on datasets stored differently, you will probably need to make your own preprocessing script.
- In the previous tree, files **were not represented** and **max depth was set to 3** for simplicity.
- If you run training of both **models at the same time**, repository structure will be different.

# Pretrained model and Samples:
Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) [here](https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-378741465). THIS IS VERY OUTDATED, I WILL UPDATE THIS SOON

# Model Architecture:
<p align="center">
Expand All @@ -69,23 +87,24 @@ To have an overview of our advance on this project, please refer to [this discus
since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.

# How to start
first, you need to have python 3 installed along with [Tensorflow v1.6](https://www.tensorflow.org/install/).
first, you need to have python 3 installed along with [Tensorflow](https://www.tensorflow.org/install/).

next you can install the requirements. If you are an Anaconda user:
next you can install the requirements. If you are an Anaconda user: (else replace **pip** with **pip3** and **python** with **python3**)

> pip install -r requirements.txt

else:

> pip3 install -r requirements.txt

# Dataset:
We tested the code above on the [ljspeech dataset](https://keithito.com/LJ-Speech-Dataset/), which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)

We are also running current tests on the [new M-AILABS speech dataset](http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/) which contains more than 700h of speech (more than 80 Gb of data) for more than 10 languages.

After **downloading** the dataset, **extract** the compressed file, and **place the folder inside the cloned repository.**

# Hparams setting:
Before proceeding, you must pick the hyperparameters that suit best your needs. While it is possible to change the hyper parameters from command line during preprocessing/training, I still recommend making the changes once and for all on the **hparams.py** file directly.

To pick optimal fft parameters, I have made a **griffin_lim_synthesis_tool** notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the **hparams.py** and have meaningful names so that you can try multiple things with them.

# Preprocessing
Before running the following steps, please make sure you are inside **Tacotron-2 folder**

Expand All @@ -95,90 +114,76 @@ Preprocessing can then be started using:

> python preprocess.py

or

> python3 preprocess.py

dataset can be chosen using the **--dataset** argument. If using M-AILABS dataset, you need to provide the **language, voice, reader, merge_books and book arguments** for your custom need. Default is **Ljspeech**.

Example M-AILABS:

> python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'

or if you want to use all books for a single speaker:

> python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True

This should take no longer than a **few minutes.**

# Training:
Feature prediction model can be **trained** using:
To **train both models** sequentially (one after the other):

> python train.py --model='Tacotron'

or
> python train.py --model='Tacotron-2'

> python3 train.py --model='Tacotron'

checkpoints will be made each **100 steps** and stored under **logs-Tacotron folder.**
Feature prediction model can **separately** be **trained** using:

Naturally, **training the wavenet** is done by: (Not implemented yet)
> python train.py --model='Tacotron'

> python train.py --model='Wavenet'
checkpoints will be made each **5000 steps** and stored under **logs-Tacotron folder.**

or
Naturally, **training the wavenet separately** is done by:

> python3 train.py --model='Wavenet'
> python train.py --model='WaveNet'

logs will be stored inside **logs-Wavenet**.

**Note:**
- If model argument is not provided, training will default to Tacotron model training.
- If model argument is not provided, training will default to Tacotron-2 model training. (both models)
- Please refer to train arguments under [train.py](https://github.com/begeekmyfriend/Tacotron-2/blob/master/train.py) for a set of options you can use.
- It is now possible to make wavenet preprocessing alone using **wavenet_proprocess.py**.

# Synthesis
There are **three types** of mel spectrograms synthesis for the Spectrogram prediction network (Tacotron):
To **synthesize audio** in an **End-to-End** (text to audio) manner (both models at work):

- **Evaluation** (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.
> python synthesize.py --model='Tacotron-2'

> python synthesize.py --model='Tacotron' --mode='eval'
For the spectrogram prediction network (separately), there are **three types** of mel spectrograms synthesis:

or
- **Evaluation** (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.

> python3 synthesize.py --model='Tacotron' --mode='eval'
> python synthesize.py --model='Tacotron' --mode='eval'

- **Natural synthesis** (let the model make predictions alone by feeding last decoder output to the next time step).

> python synthesize.py --model='Tacotron' --GTA=False

or

> python3 synthesize.py --model='Tacotron' --GTA=False

- **Ground Truth Aligned synthesis** (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)

> python synthesize.py --model='Tacotron'
> python synthesize.py --model='Tacotron' --GTA=True

or
Synthesizing the **waveforms** conditionned on previously synthesized Mel-spectrograms (separately) can be done with:

> python3 synthesize.py --model='Tacotron'

Synthesizing the waveforms conditionned on previously synthesized Mel-spectrograms can be done with:

> python synthesize.py --model='Wavenet'

or

> python3 synthesize.py --model='Wavenet'
> python synthesize.py --model='WaveNet'

**Note:**
- If model argument is not provided, synthesis will default to Tacotron model synthesis.
- If mode argument is not provided, synthesis defaults to Ground Truth Aligned synthesis.

# Pretrained model and Samples:
Pre-trained models and audio samples will be added at a later date due to technical difficulties. You can however check some primary insights of the model performance (at early stages of training) [here](https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-378741465).
- If model argument is not provided, synthesis will default to Tacotron-2 model synthesis. (End-to-End TTS)
- Please refer to synthesis arguments under [synthesize.py](https://github.com/begeekmyfriend/Tacotron-2/blob/master/synthesize.py) for a set of options you can use.


# References and Resources:
- [Tensorflow original tacotron implementation](https://github.com/keithito/tacotron)
- [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
- [Original tacotron paper](https://arxiv.org/pdf/1703.10135.pdf)
- [Attention-Based Models for Speech Recognition](https://arxiv.org/pdf/1506.07503.pdf)
- [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
- [Wavenet: A generative model for raw audio](https://arxiv.org/pdf/1609.03499.pdf)
- [Fast Wavenet](https://arxiv.org/pdf/1611.09482.pdf)
- [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder)
- [keithito/tacotron](https://github.com/keithito/tacotron)

**Work in progress**
Loading