Skip to content

Commit

Permalink
Reformat README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
tomaszcichy98 committed Jan 11, 2024
1 parent ad05034 commit b0e2eba
Showing 1 changed file with 47 additions and 15 deletions.
62 changes: 47 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,27 @@
# Pheme Model
This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the paper: [Pheme: Efficient and Conversational Speech Generation](https://arxiv.org/pdf/2401.02839.pdf). Demo is available [here](https://huggingface.co/spaces/PolyAI/pheme) while a selection of audio samples can be found [here](https://polyai-ldn.github.io/pheme/)

This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the
paper: [Pheme: Efficient and Conversational Speech Generation](https://arxiv.org/pdf/2401.02839.pdf). Demo is
available [here](https://huggingface.co/spaces/PolyAI/pheme), while a selection of audio samples can be
found [here](https://polyai-ldn.github.io/pheme/).

Our Pheme TTS framework validates several hypotheses:
1. We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data).

1. We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or
SoundStorm (e.g., 10x fewer data).
2. Training can be performed with conversational, podcast, and noisy data like GigaSpeech.
3. Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) and inference efficiency (reduced latency).
3. Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data)
and inference efficiency (reduced latency).
4. One fundamental ingredient is the separation of semantic and acoustic tokens and the adequate speech tokenizer.
5. Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized autoregressive models.
6. The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by third-party providers.

5. Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized
autoregressive models.
6. The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by
third-party providers.

# Set Up the Environment

Set up conda environment:

```
conda create --name pheme3 python=3.10
conda activate pheme3
Expand All @@ -21,6 +31,7 @@ pip3 install -r requirements.txt --no-deps
```

Download pre-trained SpeechTokenizer and unique token list models:

``` bash
st_dir="ckpt/speechtokenizer/"
mkdir -p ${st_dir}
Expand All @@ -32,11 +43,13 @@ wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/unique_text_to
```

You need to create an access token to use the speaker embedding of pyannote.

```
export HUGGING_FACE_HUB_TOKEN=YOUR_PRIVATE_TOKEN
```

Download pre-trained T2S and S2A models (the 100M Pheme variant):

``` bash
git clone https://huggingface.co/PolyAI/pheme_small ckpt/pheme
mkdir -p "ckpt/t2s"
Expand All @@ -47,18 +60,25 @@ mv ckpt/pheme/t2s.bin ckpt/t2s/pytorch_model.bin
mv ckpt/pheme/config_s2a.json ckpt/s2a/config.json
mv ckpt/pheme/s2a.ckpt ckpt/s2a/s2a.ckpt
```

or the larger version (300M) at `https://huggingface.co/PolyAI/pheme`

# Prompt-based Generation

The generation can be invoked by:

```
python transformer_infer.py
```

# Training

## Data Preparation
The package requires data of the format: `datasets/example/train.json` with `datasets/audios/` where you store wav files.

The package requires data of the format: `datasets/example/train.json` with `datasets/audios/` where you store wav
files.
The manifest should follow the format:

```
{
"LJ001-0051.wav": {
Expand All @@ -74,15 +94,19 @@ The manifest should follow the format:
}
```

The following command will create semantic and acoustic tokens based on the `audios` folder.

```
python utils/get_tokens_speech_tokenizer.py \
--config_path ckpt/speechtokenizer/config.json \
--ckpt_path ckpt/speechtokenizer/SpeechTokenizer.pt \
--encoding_input datasets/example/audios \
--encoding_output datasets/example/audios-speech-tokenizer
```

## T2S

```
python train_t2s.py --metapath datasets/example/train.json \
--val_metapath datasets/example/train.json \
Expand All @@ -91,7 +115,9 @@ python train_t2s.py --metapath datasets/example/train.json \
--nworkers 12 --warmup_steps 10000 \
--save_steps 500 --n_epochs 10
```

## A2S

```
python train_s2a.py --saving_path exp/a2s --sampledir exp/a2s --vocoder_type SPEECHTOKENIZER \
--n_codes 1024 --n_cluster_groups 7 --metapath datasets/example/train.json \
Expand All @@ -105,30 +131,36 @@ python train_s2a.py --saving_path exp/a2s --sampledir exp/a2s --vocoder_type SPE
```

## Speed test with TensoRT-LLM:

### A100 GPU / 100M Pheme Variant
| Model | Batch Size | Steps | RTF (ms) |
| --------------------------- | --------- | ----------- | ----------- |
| T2S-S2A Short sentence | 1 | 16 | 0.133 |
| T2S-S2A Long sentence | 1 | 16 | 0.133 |

| Model | Batch Size | Steps | RTF (ms) |
|------------------------|------------|-------|----------|
| T2S-S2A Short sentence | 1 | 16 | 0.133 |
| T2S-S2A Long sentence | 1 | 16 | 0.133 |

### A100 GPU / 300M Pheme Variant
| Model | Batch Size | Steps | RTF (ms) |
| --------------------------- | --------- | ----------- | ----------- |
| T2S-S2A Short sentence | 1 | 16 | 0.143 |
| T2S-S2A Long sentence | 1 | 16 | 0.143 |

| Model | Batch Size | Steps | RTF (ms) |
|------------------------|------------|-------|----------|
| T2S-S2A Short sentence | 1 | 16 | 0.143 |
| T2S-S2A Long sentence | 1 | 16 | 0.143 |

## Acknowledge

[MQTTS](https://github.com/b04901014/MQTTS)\
[SpeechTokenizer](https://github.com/ZhangXInFD/soundstorm-speechtokenizer)\
[maskgit](https://github.com/google-research/maskgit)\
[SoundStorm](https://github.com/lifeiteng/SoundStorm)

## TODO

1. Add Tensorrt-LLM image

## Citation

If you use this code or components of the model in your own work, please cite our work as:

```Tex
@misc{budzianowski2024pheme,
title={Pheme: Efficient and Conversational Speech Generation},
Expand Down

0 comments on commit b0e2eba

Please sign in to comment.