From b0e2eba5ede15915bdd421a1f4de1e55a0b165ae Mon Sep 17 00:00:00 2001 From: Tomasz Cichy Date: Thu, 11 Jan 2024 15:53:17 +0000 Subject: [PATCH] Reformat README.md --- README.md | 62 +++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 47 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index d8176d5..57d7d3b 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,27 @@ # Pheme Model -This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the paper: [Pheme: Efficient and Conversational Speech Generation](https://arxiv.org/pdf/2401.02839.pdf). Demo is available [here](https://huggingface.co/spaces/PolyAI/pheme) while a selection of audio samples can be found [here](https://polyai-ldn.github.io/pheme/) + +This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the +paper: [Pheme: Efficient and Conversational Speech Generation](https://arxiv.org/pdf/2401.02839.pdf). Demo is +available [here](https://huggingface.co/spaces/PolyAI/pheme), while a selection of audio samples can be +found [here](https://polyai-ldn.github.io/pheme/). Our Pheme TTS framework validates several hypotheses: -1. We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data). + +1. We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or + SoundStorm (e.g., 10x fewer data). 2. Training can be performed with conversational, podcast, and noisy data like GigaSpeech. -3. Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) and inference efficiency (reduced latency). +3. Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) + and inference efficiency (reduced latency). 4. One fundamental ingredient is the separation of semantic and acoustic tokens and the adequate speech tokenizer. -5. Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized autoregressive models. -6. The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by third-party providers. - +5. Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized + autoregressive models. +6. The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by + third-party providers. # Set Up the Environment + Set up conda environment: + ``` conda create --name pheme3 python=3.10 conda activate pheme3 @@ -21,6 +31,7 @@ pip3 install -r requirements.txt --no-deps ``` Download pre-trained SpeechTokenizer and unique token list models: + ``` bash st_dir="ckpt/speechtokenizer/" mkdir -p ${st_dir} @@ -32,11 +43,13 @@ wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/unique_text_to ``` You need to create an access token to use the speaker embedding of pyannote. + ``` export HUGGING_FACE_HUB_TOKEN=YOUR_PRIVATE_TOKEN ``` Download pre-trained T2S and S2A models (the 100M Pheme variant): + ``` bash git clone https://huggingface.co/PolyAI/pheme_small ckpt/pheme mkdir -p "ckpt/t2s" @@ -47,18 +60,25 @@ mv ckpt/pheme/t2s.bin ckpt/t2s/pytorch_model.bin mv ckpt/pheme/config_s2a.json ckpt/s2a/config.json mv ckpt/pheme/s2a.ckpt ckpt/s2a/s2a.ckpt ``` + or the larger version (300M) at `https://huggingface.co/PolyAI/pheme` # Prompt-based Generation + The generation can be invoked by: + ``` python transformer_infer.py ``` + # Training ## Data Preparation -The package requires data of the format: `datasets/example/train.json` with `datasets/audios/` where you store wav files. + +The package requires data of the format: `datasets/example/train.json` with `datasets/audios/` where you store wav +files. The manifest should follow the format: + ``` { "LJ001-0051.wav": { @@ -74,7 +94,9 @@ The manifest should follow the format: } ``` + The following command will create semantic and acoustic tokens based on the `audios` folder. + ``` python utils/get_tokens_speech_tokenizer.py \ --config_path ckpt/speechtokenizer/config.json \ @@ -82,7 +104,9 @@ python utils/get_tokens_speech_tokenizer.py \ --encoding_input datasets/example/audios \ --encoding_output datasets/example/audios-speech-tokenizer ``` + ## T2S + ``` python train_t2s.py --metapath datasets/example/train.json \ --val_metapath datasets/example/train.json \ @@ -91,7 +115,9 @@ python train_t2s.py --metapath datasets/example/train.json \ --nworkers 12 --warmup_steps 10000 \ --save_steps 500 --n_epochs 10 ``` + ## A2S + ``` python train_s2a.py --saving_path exp/a2s --sampledir exp/a2s --vocoder_type SPEECHTOKENIZER \ --n_codes 1024 --n_cluster_groups 7 --metapath datasets/example/train.json \ @@ -105,30 +131,36 @@ python train_s2a.py --saving_path exp/a2s --sampledir exp/a2s --vocoder_type SPE ``` ## Speed test with TensoRT-LLM: + ### A100 GPU / 100M Pheme Variant -| Model | Batch Size | Steps | RTF (ms) | -| --------------------------- | --------- | ----------- | ----------- | -| T2S-S2A Short sentence | 1 | 16 | 0.133 | -| T2S-S2A Long sentence | 1 | 16 | 0.133 | + +| Model | Batch Size | Steps | RTF (ms) | +|------------------------|------------|-------|----------| +| T2S-S2A Short sentence | 1 | 16 | 0.133 | +| T2S-S2A Long sentence | 1 | 16 | 0.133 | ### A100 GPU / 300M Pheme Variant -| Model | Batch Size | Steps | RTF (ms) | -| --------------------------- | --------- | ----------- | ----------- | -| T2S-S2A Short sentence | 1 | 16 | 0.143 | -| T2S-S2A Long sentence | 1 | 16 | 0.143 | +| Model | Batch Size | Steps | RTF (ms) | +|------------------------|------------|-------|----------| +| T2S-S2A Short sentence | 1 | 16 | 0.143 | +| T2S-S2A Long sentence | 1 | 16 | 0.143 | ## Acknowledge + [MQTTS](https://github.com/b04901014/MQTTS)\ [SpeechTokenizer](https://github.com/ZhangXInFD/soundstorm-speechtokenizer)\ [maskgit](https://github.com/google-research/maskgit)\ [SoundStorm](https://github.com/lifeiteng/SoundStorm) ## TODO + 1. Add Tensorrt-LLM image ## Citation + If you use this code or components of the model in your own work, please cite our work as: + ```Tex @misc{budzianowski2024pheme, title={Pheme: Efficient and Conversational Speech Generation},