Add VITS 2 model #123

Marioando · 2024-11-01T08:08:59Z

Hi,
I'm working on adding vits2 model to coqui framework, while testing the implementation, I found out that the model does train well on single gpu, but as soon as the second step in multigpu training all loss are normal , i.e loss0 and loss1, exept loss2 which is the loss of the duration discriminator layer( it became nan). So here is my question , do you think I need to modify the trainer or modify the batch sampler in the model. Also I have made some change to the trainer to filter out null gradient in multigpu but doesnt work . Here is what I have already tryed : decrease learning rate for the duration discriminator, add gradient clipping, decrease batch size to 1 for testing none work on multigpu. The model seems to learn well on single gpu setup.
Thanks

Marioando · 2024-11-01T08:13:40Z

--> TIME: 2024-11-01 07:59:12 -- STEP: 199/406 -- GLOBAL_STEP: 100200
| > loss_disc: 2.293909788131714 (2.353372510354123)
| > loss_disc_real_0: 0.050190214067697525 (0.09111052809573301)
| > loss_disc_real_1: 0.22900593280792236 (0.20297033208698484)
| > loss_disc_real_2: 0.2125558704137802 (0.220549658090625)
| > loss_disc_real_3: 0.2014939934015274 (0.22777624622960785)
| > loss_disc_real_4: 0.2580271363258362 (0.22694031534782008)
| > loss_disc_real_5: 0.23579958081245422 (0.23088655137836034)
| > loss_0: 2.293909788131714 (2.353372510354123)
| > grad_norm_0: tensor(38.8719, device='cuda:0') (tensor(168.8013, device='cuda:0'))
| > loss_gen: 2.4380598068237305 (2.5592831391185973)
| > loss_kl: 3.0022356510162354 (5.0805860691933145)
| > loss_feat: 5.34114408493042 (5.2965420885900745)
| > loss_mel: 20.770143508911133 (21.53628662722793)
| > loss_duration: 1.849429965019226 (1.862948954404898)
| > loss_1: 33.4010124206543 (36.335646701218515)
| > grad_norm_1: tensor(815.8643, device='cuda:0') (tensor(1645.5496, device='cuda:0'))
| > loss_dur_disc: nan
| > loss_dur_disc_real_0: nan
| > amp_scaler: 64.0 (227.05527638190944)
| > loss_2: nan
| > grad_norm_2: tensor(0) (tensor(0))
| > current_lr_0: 0.0002
| > current_lr_1: 0.0002
| > current_lr_2: 0.0002
| > step_time: 1.8149 (1.429261895280387)
| > loader_time: 0.0206 (0.015218985140623162)

Marioando · 2024-11-01T08:16:13Z

if optimizer_idx == 2:

            output_prob_for_real, output_probs_for_pred = self.dur_disc(
                self.model_outputs_cache['hidden_encoded_text'],
                self.model_outputs_cache['hidden_encoded_text_mask'],
                self.model_outputs_cache['real_durations'],  # logscaled
                self.model_outputs_cache['predicted_durations'] # logscaled
            )

            outputs = {
                "real_durations": self.model_outputs_cache['real_durations'],  # logscaled
                "predicted_durations": self.model_outputs_cache['predicted_durations']  # logscaled
            }

            with autocast(enabled=False):
                loss_dict = criterion[optimizer_idx](
                    output_prob_for_real,
                    output_probs_for_pred,
                )

            return outputs, loss_dict

eginhard · 2024-11-01T21:19:34Z

Cool, would be happy to add Vits 2! Are you basing it on the initial work from @p0p4k in coqui-ai#3355?

Impossible to say why something isn't working without seeing any code. But I'm fine with merging something that just works with one GPU for now. It could be improved later. The original Vits had some issues with multi-GPU as well (#103), are these the same issues?

Marioando · 2024-11-02T04:53:17Z

I'll try to fix it before a PR. I can say that vits2 is a massive improvement on vits, at least to my ears, the model seems to be way more robust than vits, In my implementation, vits model trained with coqui can be trained as vits 2 by reiniting dp and text encoder at the beginning of the training, which allows me do compare the models.
I didnt use the prototype form p0p4k, it was way easier to start from the original vits in coqui.
I'm currently busy trying to add @p0p4k pflow implementation and this is my priority but I will try to work on the model as soon as possible.
Thank you for your time! I think coqui framework does make experimenting with tts way faster ! We appreciate your work maintaining this repo! Thank you.

p0p4k · 2024-11-02T05:04:44Z

Thanks for doing this work guys. If you need any other paper implementation or need assistance with porting to coqui lmk.

Marioando · 2024-11-03T05:18:06Z

@p0p4k how do we know when to freeze the duration discriminator in vits2 and also when to remove the noise from mas.

p0p4k · 2024-11-04T11:00:07Z

@Marioando For duration discriminator, do you mean freeze before we start it to train or freeze after the MAS is trained for sometime and gives accurate results?
The number of steps to remove noise from MAS might be experimental, id say 10k steps should be fine.

Marioando · 2024-11-04T14:48:10Z

@p0p4k I thought that the vits 2 paper said they trained the duration disc for 30k step, I reread the paper again and it was duration predictor so we dont need to freeze the duration disc, just freeze the duration predictor after we got good result. Right!?

p0p4k · 2024-11-04T23:14:18Z

Right, I was thinking that initially the mas is still waiting for text embeddings to get to a reasonable place to give the right gt durations and so we can wait for it to stabilize first and then begin the duration disc to train.

Marioando · 2024-11-07T06:44:36Z

@eginhard I have made a PR for vits2 here is some audio from the model :
vits2_audio_samples.zip.tar.gz
It's still not perfect but I think we can improve it. Concerning the multigpu training issues will computing loss1 with loss2 help i.e : using only two optimizer.

Marioando · 2024-11-07T06:49:42Z

I trained the model using d-vector.

eginhard added the enhancement New feature or request label Nov 8, 2024

eginhard changed the title ~~Vits 2 doesnt work on multigpu training.~~ Add VITS 2 model Nov 8, 2024

eginhard linked a pull request Nov 8, 2024 that will close this issue

A prototype for Vits 2 / Yourtts 2 #137

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VITS 2 model #123

Add VITS 2 model #123

Marioando commented Nov 1, 2024

Marioando commented Nov 1, 2024

Marioando commented Nov 1, 2024

eginhard commented Nov 1, 2024

Marioando commented Nov 2, 2024

p0p4k commented Nov 2, 2024

Marioando commented Nov 3, 2024

p0p4k commented Nov 4, 2024

Marioando commented Nov 4, 2024 •

edited

Loading

p0p4k commented Nov 4, 2024

Marioando commented Nov 7, 2024 •

edited

Loading

Marioando commented Nov 7, 2024

Add VITS 2 model #123

Add VITS 2 model #123

Comments

Marioando commented Nov 1, 2024

Marioando commented Nov 1, 2024

Marioando commented Nov 1, 2024

eginhard commented Nov 1, 2024

Marioando commented Nov 2, 2024

p0p4k commented Nov 2, 2024

Marioando commented Nov 3, 2024

p0p4k commented Nov 4, 2024

Marioando commented Nov 4, 2024 • edited Loading

p0p4k commented Nov 4, 2024

Marioando commented Nov 7, 2024 • edited Loading

Marioando commented Nov 7, 2024

Marioando commented Nov 4, 2024 •

edited

Loading

Marioando commented Nov 7, 2024 •

edited

Loading