Gradio interface for fine-tuning or train to make it user-friendly and accessible for beginners, as well as to help the community #143

lpscr · 2024-10-17T12:01:16Z

lpscr
Oct 17, 2024

Hello everyone!

I have created a Gradio application for easy fine-tuning and training of models. You can find it here:
https://github.com/lpscr/F5-TTS

EDIT : this merge in main repo

NEW : with new version all now automatic you can simple easy finetune any langauge with just simple click

Here's a new complete video step by step, there is sound!:
Please make sure the video is not muted click the speaker icon in video ! . enjoy ;)

amazing.tutorial.mp4

BTW :
The man voice in the video was created using f5tts !
you can get from here https://github.com/SWivid/F5-TTS/tree/main/src/f5_tts/infer/examples/basic

note for the new language

Before you start, if you are going to fine-tune a new language, you will need a substantial amount of dataset hours! As I see here, you can fine-tune a single voice with just 10 to 15 hours. but for multiple speakers, you’ll need more—about 50 hours to start. If you want a good model, aim for at least 100 hours; for something perfect, aim for at least 300 hours or more. See what works for you; it might also be possible to achieve good results with fewer hours in your case.

also mentioned success with just 10-15 hours of fine-tuning for one or two voices. follow langauge
Spanish , Indian (Malayalam) with extend tokens , Hungarian

note for the English or Chinese

Regarding English or Chinese, if you want to fine-tune a speaker, first check if it's already working because the model is good enough,
and you may not need to fine-tune the speaker. You can test with 2 to 5 hours or more and see what works.

Please share any experiments or results about what works and what doesn’t, so that others can know as well

quick start

first create new project then see what you need

1 . Transcribe Data Option: Skip this if you already have a metadata.csv and wavs folder.
You can simply click the audio button to open Explorer and select one or multiple audio files.
If you check the audio from path, you need to place all audio files in data/my_speak/dataset

my_speak
│
├── dataset/
│   ├── audio1.wav
│   └── audio2.wav

you can click button random sample to see text and audio

2 . Vocab Check Option: Use this only when you want to train a new language.
If you need to extend the vocab, you can easily click "Check Vocab"
to see all missing symbols or write your symbols like a,b,c,d etc.
If you click "Extend," this creates a new model_1200000.pt and vocab.txt file:

├── data/
│└──my_speak/
│       └── vocab.txt
│
├── ckpts/
    └── my_speak/
        ├── model_1200000.pt/

3 . Prepare Data Option: Skip this if you already have raw.arrow, duration.json, and vocab.txt. You can click the random sample button to see token and audio.
If you have the files raw.arrow, duration.json, and vocab.txt, make sure they are in the correct path:

my_speak
│
├── raw.arrow
├── duration.json
└── vocab.txt

in case you skip the Transcribe , place your dataset (wavs folder and metadata.csv file):

my_speak
│
├── wavs/
│   ├── audio1.wav
│   └── audio2.wav
│
└── metadata.csv

Supported audio formats: "wav", "mp3", "aac", "flac", "m4a", "alac", "ogg", "aiff", "wma", "amr".

format how look like in metadata.csv

Click "Prepare" to create raw.arrow, duration.json, and vocab.txt.

4 . Train Data:
auto setting button this give you best results but you need check if all ok

If you encounter memory issues, try using your own settings.
barch size per gpu to lower number

about how offer save a checkpoints and last point ,
save per updatesey something working for you
last per step # use smaller value for to save model_last.pt more offer like this when crash train or stop you can easy continue where you left

note: every checkpoint you need 5G disk space !

About the epoch , the default is now 10 epochs. You may need more or less see what working for you

When the model trains, you get sample audio every few steps to see how well the model is doing. Click the refresh button or check in the path ckpts/my_speak/sample folder.

5 . Test Model: Testing your model is simple and easy. Check use_ema to be True or False to see what works best for you.
when you run the train the test model working in cpu mode ! you need stop the train to run in gpu

Click the 'Random Sample' button to view get a text and audio. for dataset
You can compare reference (ref) and generated (gen) audio, enter text in 'gen_text,' or load a new reference in 'ref_text.'
To load your audio reference, click the 'X' button. If the ref text is empty, it will automatically transcribe.

6 . Reduce Model Size: You can reduce the model size from 5GB to 1GB.

you find check point to

├── ckpts/
    └── my_speak/
        ├── model_10000.pt/
        ├── model_20000.pt/
        ├── model_1200000.pt/
        ...

you see all automatic now ;)

lpscr · 2024-10-18T08:57:55Z

lpscr
Oct 18, 2024
Author

In the new update, there is an 'experiment' button in the auto train settings. I'm still working on this part

so please check setting before start train to make sure all ok for the dataset you use

If you encounter memory issues, try using your own settings.

barch size per gpu to lower number

about mult-gpu check al ok
barch size per gpu see what working best for you

about how offer save a checkpoints and last point ,
save per update
last per step # use smaller value for to save model_last.pt more offer like this when crash train or stop you can easy continue where you left

note: every checkpoint you need 5G disk space !

About the epoch, the default is now 10 epochs. You may need more or less see what working for you
epoch

i try make more test and i make some good value soon

0 replies

leoiania · 2024-10-18T12:08:49Z

leoiania
Oct 18, 2024

What does the Check Vocab step do practically? I'm sorry, but I'm not able to get it; what happens in this part if I want to fine-tune for french or spanish etc.?

3 replies

lpscr Oct 18, 2024
Author

This checks all the symbols in the pre-trained model. If any are missing, it will give you a list of the missing symbols. If there are no errors, you can fine-tune it without any issues. Just make sure the project data/project_name/metadata.csv file is in place.

Edit :

my_speak
│
├── wavs/
│   ├── audio1.wav
│   └── audio2.wav
│
└── metadata.csv

Supported audio formats: "wav", "mp3", "aac", "flac", "m4a", "alac", "ogg", "aiff", "wma", "amr".

format how look like in metadata.csv

i hope this help

also check here about finetune #57

lpscr Oct 18, 2024
Author

@leoiania for spanish you can train with not problem all symbols wokring i make a quick train test and look like start lear

#57 (comment)

leoiania Oct 21, 2024

thank you! for this info and for your work on the gradio interface too!

lpscr · 2024-10-18T20:28:15Z

lpscr
Oct 18, 2024
Author

what you think it's good idea to add info for the system and gpu new tab ?

0 replies

rikabi89 · 2024-10-19T01:02:35Z

rikabi89
Oct 19, 2024

Thanks for this great work. I've managed to fine-tine my first model, but I noob question is how do I test the model whether in cli or gui?

1 reply

lpscr Oct 19, 2024
Author

in inference-cli.py to change the model

EDIT: make sure you have last update in rp i just make update on this to easy load model
arg --ckpt_file and file pt in your model
change the data/my_speak/model_80000.pt with your model file

python inference-cli.py \
--model "F5-TTS" \
--ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
--ref_text "Some call me nature, others call me mother nature." \
--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences." \
--ckpt_file data/my_speak/model_80000.pt \

EDIT: i create api so now you can simple use make sure you update the main repo

from api import F5TTS

f5tts=F5TTS(ckpt_file=r"/home/F5-TTS/ckpts/my_speak/model_1200000.pt") 
# if you extend vocab make sure you use also the vocab_file

wav,sr,spect=f5tts.infer(
    ref_file="tests/ref_audio/test_en_1_ref_short.wav",
    ref_text="Some call me nature, others call me mother nature.",
    gen_text="""I don't really care what you call me.""",
    file_wave="/home/F5-TTS/test.wav"
)

make sure you are update the main repo

FurkanGozukara · 2024-10-19T10:20:41Z

FurkanGozukara
Oct 19, 2024

awesome work are you planning to improver and fix errors?

6 replies

FurkanGozukara Oct 19, 2024

@lpscr i haven't tested yet actually today learnt but before investing time I wanted to congratulate and ask :D

lpscr Oct 19, 2024
Author

You must try this! I'm sure you’ll find this amazing repo for high-quality TTS very cool.

FurkanGozukara Oct 19, 2024

@lpscr awesome. last time fine tuning i tried was coqui TTS. so this is better than that you think?

lpscr Oct 19, 2024
Author

For me, it's the best I've ever seen and my favorite. That's why I also made gradio finetune to make easier to use for beginner users. i have try a lot repo before tts like Coqui TTS, but this repository provides the highest quality TTS I’ve encountered. Try it yourself to see and compare!”

Dazzastrous Oct 23, 2024

Yeh Doc this is the best

sofianhw · 2024-10-20T17:00:57Z

sofianhw
Oct 20, 2024

@lpscr
Do we need training with Wenetspeech4TTS or using available checkpoints.
then fine-tune to other language.
Or just go fine-tune from the first place?

1 reply

lpscr Oct 21, 2024
Author

you can use preetrain model to use less data , the only need to have
you metadata.csv and wavs folder files

if you use Transcription make all for you automatic only need load audio file

or

you can place the dataset like this

Edit :

my_speak
│
├── wavs/
│   ├── audio1.wav
│   └── audio2.wav
│
└── metadata.csv

Supported audio formats: "wav", "mp3", "aac", "flac", "m4a", "alac", "ogg", "aiff", "wma", "amr".

format how look like in metadata.csv

jazza420 · 2024-10-21T11:10:28Z

jazza420
Oct 21, 2024

you should make it so it informs the user if ffmpeg is not available instead of silently failing on the transcribe, this was giving me problems on a runpod instance.
Also Great work!

1 reply

lpscr Oct 21, 2024
Author

@jazza420 hi thank you for let me know

in windows download download

https://www.gyan.dev/ffmpeg/builds/
extract files then go inside to bin folder you find 3 exe files

create folder in c: with name ffmpeg
put in C:\ffmpeg the exe files ffmpeg.exe , ffplay.exe , ffprobe.exe

run cmd as admin
and run this command to set path after must wroking fine
setx /M PATH "%PATH%;C:\ffmpeg

restart and must working fine

i hope this help

for linux use simple install
sudo apt install ffmpeg

lpscr · 2024-10-21T16:34:55Z

lpscr
Oct 21, 2024
Author

hi new update

transcribe you can now click button random sample to see how look like the transcribe

prepare you can now click button random sample to see how look like the prepare

test model
you can simple change model
if you click random sample this get audio for the data your train and text also you can change the text like want

1 reply

GSaucedoA Nov 23, 2024

How can I contine training a model? I have the metadata.csv, wavs folder and last_model.pt from spanish model, but I´m not getting good results with my voice, I want to continue training the spanish model to me more accurate with my voice

Bjorjank · 2024-10-22T07:03:59Z

Bjorjank
Oct 22, 2024

Can someone explain to me how to practice with languages other than English?

1 reply

lpscr Oct 22, 2024
Author

@Bjorjank i still test , to see what best for other langauge , if i success find something good i add in gradio finetune i working on
but i dont test it a lot , i still exparament method and setting stuff,
you can see more here other user try finetune

#57 (comment)

bastiansurya77 · 2024-10-22T23:32:36Z

bastiansurya77
Oct 22, 2024

How many dataset you usually use (how many hours or minutes) to finetune and produce great result?, and I tried it in collab using t4 and needed to lower the batch size per gpu too i guess. Anyway thanks for this easy to use pathway for finetuning F5.

2 replies

lpscr Oct 23, 2024
Author

@bastiansurya77 if you click auto setting you must get correct setting about batch size per gpu i dont test in colab
try click to see

bastiansurya77 Oct 28, 2024

Yeah it automatically provides me the number, but i needed to lower it because it crashing. How about my former question (how many hours or minutes data you usually use to finetune to produce great result, I want to make the results much closer to mine?)

FrnklyN · 2024-10-23T19:40:00Z

FrnklyN
Oct 23, 2024

Hello @lpscr,

Thank you for sharing this project!

I have been following your instructions:

•	Go to data/project_name/dataset and place all your audio files there (this works for multiple audio files).
•	Click user to accept the dataset for the path.

After placing WAV files in the newly created project folder and checking user, I clicked the Transcribe button and waited for the process to complete.

However, the info field displays “You need to load an audio file.”

Could you please help me resolve this issue?

Thank you!

5 replies

lpscr Oct 23, 2024
Author

@FrnklyN you need to put the file inside to data/project_name/dataset
when Transcribe done create folder wavs with all the files audio and metadata.csv with all transcribe

├── data/
│ └── project_name/
│ ├── dataset/
│ │ └── [Your audio files]

BTW : tomorrow i make big update i change a lot stuff

what you like to use to help you more ?

FrnklyN Oct 23, 2024

It's is... They are in data/voice_pinyin/database.

Or do I need to create a "voice" folder in the data folder?

lpscr Oct 23, 2024
Author

It's is... They are in data/voice_pinyin/database.

Or do I need to create a "voice" folder in the data folder?

no only you need put wave file inside let me check it

lpscr Oct 23, 2024
Author

@FrnklyN yes you right there is small bug i fix you can simple uncheck the user
and click then click in audio and you can select multi data audio
let me know this working for you ?

lpscr Oct 24, 2024
Author

@FrnklyN ok update all make sure you have last update repo you can see in first post update video also

binhphamthanh · 2024-10-26T02:57:14Z

binhphamthanh
Oct 26, 2024

usage: finetune_cli.py [-h] [--exp_name {F5TTS_Base,E2TTS_Base}] [--dataset_name DATASET_NAME]
[--learning_rate LEARNING_RATE] [--batch_size_per_gpu BATCH_SIZE_PER_GPU]
[--batch_size_type {frame,sample}] [--max_samples MAX_SAMPLES]
[--grad_accumulation_steps GRAD_ACCUMULATION_STEPS]
[--max_grad_norm MAX_GRAD_NORM] [--epochs EPOCHS]
[--num_warmup_updates NUM_WARMUP_UPDATES]
[--save_per_updates SAVE_PER_UPDATES] [--last_per_steps LAST_PER_STEPS]
[--finetune FINETUNE] [--pretrain PRETRAIN]
[--tokenizer {pinyin,char,custom}] [--tokenizer_path TOKENIZER_PATH]
finetune_cli.py: error: unrecognized arguments: --file_checkpoint_train

Hello @lpscr,

It appears that the update for the parameter --file_checkpoint_train has not yet been committed and merged into finetune_cli.py, although it has been merged in finetune_gradio.

By the way, is this parameter intended to allow resuming fine-tuning from a previous checkpoint? Could you please provide an example of the exact path to the checkpoint? I assume it would be something like ckpts/project_name/checkpoint.pt, correct?

3 replies

lpscr Oct 26, 2024
Author

hi @binhphamthanh in your ckpts/project_name/

In your ckpts/project_name/, if there is no file named model_last.pt, you won’t be able to continue training. You need to rename your last_checkpoint.pt to model_last.pt.

For example:
model_240000.pt should be renamed to model_last.pt.

When you run the training again, it will continue from where it left off.

I also want to update this soon. Essentially, you need to set the last checkpoint to save the model at smaller intervals. I’ll fix this in the next update.

Please note that the auto-setting is still experimental.

binhphamthanh Oct 26, 2024

Hi @lpscr,
It’s perfect! Thank you very much.

In the meantime, I identified an issue related to the Pretrain Model textbox in the Train Data tab. This field should be labeled as --pretrain in accordance with the latest finetune_cli.py update.

https://github.com/SWivid/F5-TTS/blob/6c623447b8506c23df83c25a00e1f75ef08beb09/src/f5_tts/train/finetune_gradio.py#L323C10-L323C11

lpscr Oct 26, 2024
Author

@binhphamthanh yes i basic you need put the path where the file pt it's and same in gradio finetune use path in checkpoint textbox
and cli call --pretrain my_project

my_project
--- checkpoint.py

osmania101 · 2024-11-02T05:18:21Z

osmania101
Nov 2, 2024

To create a public link, set share=True in launch().
accelerate launch --mixed_precision=fp16 src/f5_tts/train/finetune_cli.py --exp_name F5TTS_Base --learning_rate 1e-05 --batch_size_per_gpu 9600 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 --max_grad_norm 1 --epochs 1000 --num_warmup_updates 64 --save_per_updates 126 --last_per_steps 32 --dataset_name first_project --finetune True --tokenizer pinyin --log_samples True --logger tensorboard
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
D:\conda\envs\f5-tts\python.exe: can't open file 'D:\F5-TTS\src\f5_tts\train\src\f5_tts\train\finetune_cli.py': [Errno 2] No such file or directory
Traceback (most recent call last):
File "D:\conda\envs\f5-tts\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\conda\envs\f5-tts\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\conda\envs\F5-tts\Scripts\accelerate.exe_main.py", line 7, in
File "D:\conda\envs\f5-tts\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "D:\conda\envs\f5-tts\lib\site-packages\accelerate\commands\launch.py", line 1168, in launch_command
simple_launcher(args)
File "D:\conda\envs\f5-tts\lib\site-packages\accelerate\commands\launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\conda\envs\f5-tts\python.exe', 'src/f5_tts/train/finetune_cli.py', '--exp_name', 'F5TTS_Base', '--learning_rate', '1e-05', '--batch_size_per_gpu', '9600', '--batch_size_type', 'frame', '--max_samples', '64', '--grad_accumulation_steps', '1', '--max_grad_norm', '1', '--epochs', '1000', '--num_warmup_updates', '64', '--save_per_updates', '126', '--last_per_steps', '32', '--dataset_name', 'first_project', '--finetune', 'True', '--tokenizer', 'pinyin', '--log_samples', 'True', '--logger', 'tensorboard']' returned non-zero exit status 2. Getting this error, file is there but can't do any training.

2 replies

rasheed-aidetic Nov 12, 2024

Hi, Could you please tell me how did you solve this error?

mame82 Nov 12, 2024

The path D:\F5-TTS\src\f5_tts\train\src\f5_tts\train\finetune_cli.py suggests the scripts have a cwd issue. If you invoked from D:\F5-TTS\src\f5_tts\train\src try to run your command from D:\F5-TTS directory instead. As far as I could remember I had similar issues with the working dir. Also run accelerate config upfront to avoid the warning

lpscr · 2024-11-03T10:51:20Z

lpscr
Nov 3, 2024
Author

@osmania101 Can you check if you have the latest update repo ? . Did you follow the correct installation steps at the beginning to install

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
# git submodule update --init --recursive  # (optional, if need bigvgan)
pip install -e .

5 replies

osmania101 Nov 5, 2024

now I'm stuck here, Error: No audio files found in the specified path : D:\F5-TTS\src\f5_tts....\data\firstproject_pinyin\wavs
I followed all the instructions , I have my own data as well. but for some reason it can't find the audio . I have the audio files in there organized and ready but it just can't find it

lpscr Nov 5, 2024
Author

can you send me some line for the metadata.csv you have to check it

osmania101 Nov 6, 2024

audio_segment_1.wav|He is saying that he is equally happy
audio_segment_2.wav|He was happy about becoming a father to Mici the lion. This was the first occurrence
audio_segment_3.wav|The whole family agreed that he was leading them
audio_segment_4.wav|He knew it wasn't love that was felt towards him , metadata.csv is the problem ?

lpscr Nov 6, 2024
Author

@osmania101 i dont see any problem
check your wavs file in correct path
ex:
wavs/audio_segment_1.wav
wavs/audio_segment_2.wav

osmania101 Nov 7, 2024

Everything is in the correct path as it should be. It only works when I use the transcribe button, but I don't want to use transcription since I already have the metadata.csv. Why can't it detect my audio files directly, even though they work during transcription?

HuuHuy227 · 2024-11-03T13:37:55Z

HuuHuy227
Nov 3, 2024

How can I prepare multi speaker dataset? if i have a multi speaker dataset i should skip speaker id and only keep text and audio pair?

0 replies

lpscr · 2024-11-03T14:29:19Z

lpscr
Nov 3, 2024
Author

@HuuHuy227 You don’t need speaker ID, just the audio and text will work for both single and multiple speakers

0 replies

Jandown · 2024-11-06T07:13:23Z

Jandown
Nov 6, 2024

I encountered an error while running finetune_gradio.py
“No module named 'f5_tts'”
May I ask what the reason is

1 reply

lpscr Nov 6, 2024
Author

@Jandown hi

make sure you install correct

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n f5-tts python=3.10
conda activate f5-tts

# Install pytorch with your CUDA version, e.g.
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
# git submodule update --init --recursive  # (optional, if need bigvgan)
pip install -e .

TEJASAMA-TECH · 2024-11-08T05:26:37Z

TEJASAMA-TECH
Nov 8, 2024

hi @lpscr, I'm currently finetuning F5-TTS with gradio interface you provided for us, it's really amazing. But while I'm trying to finetune it from the previous checkpoint, I'm facing the below issue:

finetune_cli.py: error: unrecognized arguments: --file_checkpoint_train /home/teja/voice-cloning/F5-TTS/ckpts/test_en/model_last.pt

Can you please help me with this. thank you.

1 reply

lpscr Nov 9, 2024
Author

@TEJASAMA-TECH thank you happy you like the gradio finetune ,
There is an issue with the argument name; it's incorrect. Please check the finetune_gradio.py file and rename --file_checkpoint_train to --pretrain. Also, make sure you're using a path (e.g., my_project/model.pt) instead of just a model name. The file should contain the model file (e.g., model.pt inside the project folder). Additionally, ensure that the model is properly reduced and set up for fine-tuning. Once you make this change, everything should work. I will update the code further soon with additional changes I’m working on.

mame82 · 2024-11-11T15:40:54Z

mame82
Nov 11, 2024

@lpscr thank you for your contributions to this project. Although I have very limited compute resources available (RTX4070 laptop, 8gb vram, 32gb ram) I did some finetuning experiments using your gradio interface (120k steps with ~16h German speaker data ... took 2 days, yet results seem to be worth the effort).

Considering aforementioned ressource limitations, I have two questions:

Could the vocab.txt also be reduced for finetuning, instead of being extended (my dataset obviously only uses a small subset of the 2545 tokens in default vocab) or is this only possible for training "from scratch"? I assume larger token count has negative performance impact on inference (esp. if ZH output is not required)?!
Is it necessary to transcode mp3 data to wave during dataset preparation? (Increased dataset size makes cloud offloading a bit harder more costly for hobbiests like me).

Thanks again for your efforts

3 cheers, MaMe82

4 replies

mame82 Nov 14, 2024

@SWivid could you comment on the two questions, please?

SWivid Nov 14, 2024
Maintainer

@mame82 Hi~

Will do nothing negative if not full vocab is used. Only the case missing vocab will have trouble.
Just use torchaudio to load directly the mp3 files, no conversion needed (cuz torchaudio support mp3 loading)

mame82 Nov 14, 2024

Thx for the quick response. I struggled with reduced vocab (at least using the f5-tts_finetune-gradio frontend). No matter what I try on dataset preparation ("char" or "custom"), the resulting vocab.txt after dataset preparation always includes pinyin vocab. If I reduce the vocab.txt manually, I receive errors on mismatching vectorsize (expects 2546, while mine is smaller). Also, the "custom" option of the gradio interface seems to have additional errors, for example it searches for files in data\project_custom_custom instead of data\project_custom (the "custom" extensions gets appended twice). That's why I asked @lpscr initially. Yet, the gradio interface is in your repo. So could you please comment on these two aspects:

How to generate a reduced vocab.txt from the gradio interface for (only using characters which appear in the transcribed dataset) and finetune an existing checkpoint with the reduced vocab?
If not possible: How to train a model from scratch with a reduced vocab.txt (only containing characters from the transcribed dataset) using the gradio interface? (Thought this would be the purpose of "char" instead of "pinyin", but I did not get it to work)

Additional question (off topic):
I work on a privat project for speech-to-speech, similar to your voice chat demo (FasterWhisper <--> Ministral 8B <--> XTTSv2, video of test:https://x.com/mame82/status/1851351300388143229). The XTTSv2 TTS Engine uses 0-shot VC to mimic different characters. In direct comparison F5-TTS is only a bit slower, but still achieves RTF ~0.2 on a RTX 4070 Laptop, while greatly improving output quality (especially less hallucination and ability to keep prosody of reference audio). I wonder if RTF and responsivenese could be further improved, by using a quantized model?
(I already produce output for the first sentence early, iot keep initial response time short. Also, I plan to use some smart NFE-scaling (depending on fill size of audio buffer). Yet, I am interestet in performance improvement on the technical end)

SWivid Nov 15, 2024
Maintainer

How to generate a reduced vocab.txt from the gradio interface for (only using characters which appear in the transcribed dataset) and finetune an existing checkpoint with the reduced vocab?

will need to manually edit the model weight in pretrained checkpoint.
e.g if you want to leave only the line 30~50 vocabs in vocab.txt, you need to edit the text_embed weight - extract the 30~50 dim from the original 2546 dim tensor

if RTF and responsivenese could be further improved, by using a quantized model?

Yes, engineering stuff will help boost the speed a lot, fusing operators with high performance engine and stuff.
Definitely works though we haven't tried ourselves.

kimhanzoo · 2024-11-16T04:17:01Z

kimhanzoo
Nov 16, 2024

It's amazing, I have never seen a model so easy to use and delivering such great results. Thank you so much for your contribution to the community. I hope you will continue to update it with even more interesting features. Do you have any recommendations on settings or the amount of data needed when training a new language? In my case, it’s Vietnamese.

0 replies

ztxxkaty · 2024-11-18T04:12:09Z

ztxxkaty
Nov 18, 2024

Hi @lpscr , it's a fantastic job! Thanks a lot.
I'm now trying to produce the voice that could convey emotion. And I found that in your video, both male and female could speak angrily. Could you please tell me how to achieve this goal? Thanks a lot

0 replies

FrnklyN · 2024-11-18T07:08:05Z

FrnklyN
Nov 18, 2024

Hey @lpscr, the tracing works fantastic but I've quite a tracing process. How can I resume it?

0 replies

jala-R · 2024-11-18T15:07:15Z

jala-R
Nov 18, 2024

Hi when i tired to fine tune i got the following error

usage: finetune_cli.py [-h] [--exp_name {F5TTS_Base,E2TTS_Base}]
                       [--dataset_name DATASET_NAME]
                       [--learning_rate LEARNING_RATE]
                       [--batch_size_per_gpu BATCH_SIZE_PER_GPU]
                       [--batch_size_type {frame,sample}]
                       [--max_samples MAX_SAMPLES]
                       [--grad_accumulation_steps GRAD_ACCUMULATION_STEPS]
                       [--max_grad_norm MAX_GRAD_NORM] [--epochs EPOCHS]
                       [--num_warmup_updates NUM_WARMUP_UPDATES]
                       [--save_per_updates SAVE_PER_UPDATES]
                       [--last_per_steps LAST_PER_STEPS] [--finetune FINETUNE]
                       [--pretrain PRETRAIN]
                       [--tokenizer {pinyin,char,custom}]
                       [--tokenizer_path TOKENIZER_PATH]
                       [--log_samples LOG_SAMPLES]
                       [--logger {wandb,tensorboard}]
finetune_cli.py: error: unrecognized arguments: e
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Work Pc\VoiceModel\F5-TTS\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\Work Pc\VoiceModel\F5-TTS\venv\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "C:\Users\Work Pc\VoiceModel\F5-TTS\venv\Lib\site-packages\accelerate\commands\launch.py", line 1168, in launch_command
    simple_launcher(args)
  File "C:\Users\Work Pc\VoiceModel\F5-TTS\venv\Lib\site-packages\accelerate\commands\launch.py", line 763, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Work Pc\\VoiceModel\\F5-TTS\\venv\\Scripts\\python.exe', 'src/f5_tts/train/finetune_cli.py', '--exp_name', 'F5TTS_Base', '--learning_rate', '1e-05', '--batch_size_per_gpu', '1000', '--batch_size_type', 'frame', '--max_samples', '64', '--grad_accumulation_steps', '1', '--max_grad_norm', '1', '--epochs', '150', '--num_warmup_updates', '172', '--save_per_updates', '5000', '--last_per_steps', '1000', '--dataset_name', 'etherreal_seekers', 'e', '--finetune', 'True', '--tokenizer', 'pinyin', '--log_samples', 'True', '--logger', 'tensorboard']' returned non-zero exit status 2.

when i tried removing the arg 'e', then the training is stuck for so long with no logs
"To avoid this warning pass in values for each of the problematic parameters or run accelerate config." stucked here
can anyone help??

0 replies

kimhanzoo · 2024-11-20T09:21:07Z

kimhanzoo
Nov 20, 2024

I encountered a problem that after successfully traning the model and pressing "test model", the output sound was empty, silent, no sound at all. I tried again many times. . please help me

0 replies

MithrilMan · 2024-11-26T22:42:04Z

MithrilMan
Nov 26, 2024

Hi @lpscr ,
thanks for your effort, this is the first time I try to finetune a model so your UI helps a bit.
I'm still trying to understand how to properly build my dataset to finetune in Italian but I've actually setup a runpod server and I'm playing with it.
Currently I'm trying to use the facebook multilingual_librispeech dataset from hf:
https://huggingface.co/datasets/facebook/multilingual_librispeech/viewer/italian

I've created a script and successfully built a metadata.csv and a folder with my wav files, I used the 9 hours split so I'm playing actually with just 9 hours of multispeaker Italian language, is it a fair amount of hours for finetuning?

basicallly I have
~ 2137 samples each one in the range of 10-19 seconds
9 hours of total samples
I'm using a RunPod with a 4090 and I've specified bach size of 3200

I tried 10 epoch but the result isn't good.
I'm now running it for 50, will check it tomorrow, but what I noticed is that if I listen to the sample generated by the "Save per Updates" feature, the result is quite bad.
If I stop the training and use "Test Model" disabling the Ema options, the result is much better.

I don't know EMA here how works, but wouldn't make sense to add an option in "Train Data" page to specify how to generate the sample? This way I can have a better idea of the overall result comparing the samples without using EMA (actually I think it uses EMA by default?)

Feel free to give me any tips to improve results and thanks for the UI :)

P.S.
I want to try this dataset too, it seems to have a better selection from the samples I tried from the dataset:
https://huggingface.co/datasets/ylacombe/cml-tts

1 reply

MithrilMan Nov 26, 2024

let me also post the script I used to generate a folder with wavs and metadata.csv for the facebook dataset, could be useful for others to improve upon (e.g. this doesn't have any pre-processing)

import os
import soundfile as sf
import csv
from datasets import load_dataset

# Load the Italian subset of the Multilingual LibriSpeech dataset
dataset = load_dataset("facebook/multilingual_librispeech", "italian")

# Define the output directory
output_dir = "multilingual_librispeech_italian"
os.makedirs(output_dir, exist_ok=True)

def save_split(split_name, dry_run=False):
    split = dataset[split_name]
    split_dir = os.path.join(output_dir, split_name)
    os.makedirs(split_dir, exist_ok=True)

    wavs_dir = os.path.join(split_dir, "wavs")
    os.makedirs(wavs_dir, exist_ok=True)

    COLUMNS_TO_KEEP = ["transcript", "audio", "sampling_rate"]
    all_columns = split.column_names

    if dry_run:
        print(split)
        columns_to_remove = set(all_columns) - set(COLUMNS_TO_KEEP)
        split = split.remove_columns(columns_to_remove)
        print(split[0])
        return

    columns_to_remove = set(all_columns) - set(COLUMNS_TO_KEEP)
    split = split.remove_columns(columns_to_remove)

    metadata_path = os.path.join(split_dir, "metadata.csv")

    with open(metadata_path, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file, delimiter='|')

        for i, example in enumerate(split):
            # Extract audio data and sampling rate
            audio = example["audio"]
            audio_array = audio["array"]
            sampling_rate = audio["sampling_rate"]

            # Define file paths
            audio_path = os.path.join(wavs_dir, f"{i}.wav")

            # Save audio file in WAV format
            sf.write(audio_path, audio_array, sampling_rate)

            # Save transcription
            # transcription_path = os.path.join(split_dir, f"{i}.txt")
            # with open(transcription_path, "w", encoding="utf-8") as f:
            #     f.write(example["transcript"])

            # Save metadata
            writer.writerow([f"{i}.wav", example["transcript"]])

# save_split("1_hours", dry_run=True)
save_split("9_hours")

Gradio interface for fine-tuning or train to make it user-friendly and accessible for beginners, as well as to help the community #143

Replies: 25 comments · 38 replies

lpscr Oct 18, 2024 Author

lpscr Oct 18, 2024 Author

lpscr Oct 18, 2024 Author

lpscr Oct 18, 2024 Author

lpscr Oct 19, 2024 Author

lpscr Oct 19, 2024 Author

lpscr Oct 19, 2024 Author

lpscr Oct 21, 2024 Author

lpscr Oct 21, 2024 Author

lpscr Oct 21, 2024 Author

lpscr Oct 22, 2024 Author

lpscr Oct 23, 2024 Author

lpscr Oct 23, 2024 Author

lpscr Oct 23, 2024 Author

lpscr Oct 23, 2024 Author

lpscr Oct 24, 2024 Author

lpscr Oct 26, 2024 Author

lpscr Oct 26, 2024 Author

lpscr Nov 3, 2024 Author

lpscr Nov 5, 2024 Author

lpscr Nov 6, 2024 Author

lpscr Nov 3, 2024 Author

lpscr Nov 6, 2024 Author

lpscr Nov 9, 2024 Author

Replies: 25 comments 38 replies

lpscr
Oct 18, 2024
Author

lpscr Oct 18, 2024
Author

lpscr Oct 18, 2024
Author

lpscr
Oct 18, 2024
Author

lpscr Oct 19, 2024
Author

lpscr Oct 19, 2024
Author

lpscr Oct 19, 2024
Author

lpscr Oct 21, 2024
Author

lpscr Oct 21, 2024
Author

lpscr
Oct 21, 2024
Author

lpscr Oct 22, 2024
Author

lpscr Oct 23, 2024
Author

lpscr Oct 23, 2024
Author

lpscr Oct 23, 2024
Author

lpscr Oct 23, 2024
Author

lpscr Oct 24, 2024
Author

lpscr Oct 26, 2024
Author

lpscr Oct 26, 2024
Author

lpscr
Nov 3, 2024
Author

lpscr Nov 5, 2024
Author

lpscr Nov 6, 2024
Author

lpscr
Nov 3, 2024
Author

lpscr Nov 6, 2024
Author

lpscr Nov 9, 2024
Author