Reformat README.md

PolyAI-LDN · Jan 11, 2024 · b0e2eba · b0e2eba
1 parent ad05034
commit b0e2eba
Showing 1 changed file with 47 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -1,17 +1,27 @@
 # Pheme Model
-This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the paper: [Pheme: Efficient and Conversational Speech Generation](https://arxiv.org/pdf/2401.02839.pdf). Demo is available [here](https://huggingface.co/spaces/PolyAI/pheme) while a selection of audio samples can be found [here](https://polyai-ldn.github.io/pheme/)
+
+This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the
+paper: [Pheme: Efficient and Conversational Speech Generation](https://arxiv.org/pdf/2401.02839.pdf). Demo is
+available [here](https://huggingface.co/spaces/PolyAI/pheme), while a selection of audio samples can be
+found [here](https://polyai-ldn.github.io/pheme/).
 
 Our Pheme TTS framework validates several hypotheses:
-1. We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data).
+
+1. We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or
+   SoundStorm (e.g., 10x fewer data).
 2. Training can be performed with conversational, podcast, and noisy data like GigaSpeech.
-3. Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) and inference efficiency (reduced latency).
+3. Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data)
+   and inference efficiency (reduced latency).
 4. One fundamental ingredient is the separation of semantic and acoustic tokens and the adequate speech tokenizer.
-5. Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized autoregressive models.
-6. The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by third-party providers.
-
+5. Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized
+   autoregressive models.
+6. The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by
+   third-party providers.
 
 # Set Up the Environment
+
 Set up conda environment:
+
 ``` 
 conda create --name pheme3 python=3.10
 conda activate pheme3
@@ -21,6 +31,7 @@ pip3 install -r requirements.txt --no-deps
 ```
 
 Download pre-trained SpeechTokenizer and unique token list models:
+
 ``` bash
 st_dir="ckpt/speechtokenizer/"
 mkdir -p ${st_dir}
@@ -32,11 +43,13 @@ wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/unique_text_to
 ```
 
 You need to create an access token to use the speaker embedding of pyannote.
+
 ```
 export HUGGING_FACE_HUB_TOKEN=YOUR_PRIVATE_TOKEN
 ```
 
 Download pre-trained T2S and S2A models (the 100M Pheme variant):
+
 ``` bash
 git clone https://huggingface.co/PolyAI/pheme_small ckpt/pheme
 mkdir -p "ckpt/t2s"
@@ -47,18 +60,25 @@ mv ckpt/pheme/t2s.bin ckpt/t2s/pytorch_model.bin
 mv ckpt/pheme/config_s2a.json ckpt/s2a/config.json
 mv ckpt/pheme/s2a.ckpt ckpt/s2a/s2a.ckpt
 ```
+
 or the larger version (300M) at `https://huggingface.co/PolyAI/pheme`
 
 # Prompt-based Generation
+
 The generation can be invoked by:
+
 ```
 python transformer_infer.py
 ```
+
 # Training
 
 ## Data Preparation
-The package requires data of the format: `datasets/example/train.json` with `datasets/audios/` where you store wav files.
+
+The package requires data of the format: `datasets/example/train.json` with `datasets/audios/` where you store wav
+files.
 The manifest should follow the format:
+
 ```
 {
     "LJ001-0051.wav": {
@@ -74,15 +94,19 @@ The manifest should follow the format:
 }
 
 ```
+
 The following command will create semantic and acoustic tokens based on the `audios` folder.
+
 ```
 python utils/get_tokens_speech_tokenizer.py \
     --config_path ckpt/speechtokenizer/config.json \
     --ckpt_path ckpt/speechtokenizer/SpeechTokenizer.pt \
     --encoding_input datasets/example/audios \
     --encoding_output datasets/example/audios-speech-tokenizer
 ```
+
 ## T2S
+
 ```
 python train_t2s.py --metapath datasets/example/train.json \
   --val_metapath datasets/example/train.json \
@@ -91,7 +115,9 @@ python train_t2s.py --metapath datasets/example/train.json \
   --nworkers 12 --warmup_steps 10000 \
   --save_steps 500 --n_epochs 10
 ```
+
 ## A2S
+
 ```
 python train_s2a.py --saving_path exp/a2s --sampledir exp/a2s --vocoder_type SPEECHTOKENIZER \
   --n_codes 1024 --n_cluster_groups 7 --metapath datasets/example/train.json \
@@ -105,30 +131,36 @@ python train_s2a.py --saving_path exp/a2s --sampledir exp/a2s --vocoder_type SPE
 ```
 
 ## Speed test with TensoRT-LLM:
+
 ### A100 GPU / 100M Pheme Variant
-| Model                        | Batch Size |  Steps | RTF (ms) |
-| --------------------------- | --------- | ----------- | ----------- |
-| T2S-S2A Short sentence       | 1         | 16 |                 0.133 |
-| T2S-S2A Long sentence        | 1      |  16       |                 0.133 |
+
+| Model                  | Batch Size | Steps | RTF (ms) |
+|------------------------|------------|-------|----------|
+| T2S-S2A Short sentence | 1          | 16    | 0.133    |
+| T2S-S2A Long sentence  | 1          | 16    | 0.133    |
 
 ### A100 GPU / 300M Pheme Variant
-| Model                        | Batch Size |  Steps | RTF (ms) |
-| --------------------------- | --------- | ----------- | ----------- |
-| T2S-S2A Short sentence      | 1         | 16      |                 0.143 |
-| T2S-S2A Long sentence       | 1           |  16       |                 0.143 |
 
+| Model                  | Batch Size | Steps | RTF (ms) |
+|------------------------|------------|-------|----------|
+| T2S-S2A Short sentence | 1          | 16    | 0.143    |
+| T2S-S2A Long sentence  | 1          | 16    | 0.143    |
 
 ## Acknowledge
+
 [MQTTS](https://github.com/b04901014/MQTTS)\
 [SpeechTokenizer](https://github.com/ZhangXInFD/soundstorm-speechtokenizer)\
 [maskgit](https://github.com/google-research/maskgit)\
 [SoundStorm](https://github.com/lifeiteng/SoundStorm)
 
 ## TODO
+
 1. Add Tensorrt-LLM image
 
 ## Citation
+
 If you use this code or components of the model in your own work, please cite our work as:
+
 ```Tex
 @misc{budzianowski2024pheme,
       title={Pheme: Efficient and Conversational Speech Generation},