About MAS (monotonic alignment search) #2

hcy71o · 2023-11-07T08:53:40Z

I've also implemented an E2E system using a CFM prior (different flow matching architecture instead of 1D-Unet). Despite using the prior loss in Grad-TTS, alignment framework fails to converge. (Prior loss between text encoder output and latent variable z_0) Has anyone managed to solve this problem without using an external aligner (MFA)?

p0p4k · 2023-11-07T09:07:59Z

Hi, can you help me understand few things about CFM, if it is not a problem? Add me on discord p0p4k .

p0p4k · 2023-11-10T07:03:29Z

at my current state of the repo, it seems to converge on my private dataset.

p0p4k · 2023-11-10T07:05:08Z

I removed prior loss as i think it is kind of restricting the model's capacity. This is without prior guidance. It may or may not be good, time will tell.

hcy71o · 2023-11-10T07:29:36Z

Thanks for sharing. I have one question after looking at the code, is the generation performance good even detaching the posterior output? I thought it would cause overfitting in the latent domain and CFM learning would not work well, but I'm surprised it works. I'm currently using Prior loss and WaveNet-based CFM, but I don't know the performance yet.

diff_loss, _ = self.decoder.compute_loss(x1=z_spec.detach(), mask=y_mask, mu=mu_y, spks=spks, cond=cond)

Tensorboard에 KST가 나와있어서 한국어로도 질문드립니다 (위는 DeepL 번역본)
공유 감사합니다. 코드를 본 뒤에 질문 한가지가 있는데, posterior output을 detach 해주어도 생성 성능이 잘 나오나요?
latent 도메인에서 오버피팅이 생겨서 CFM 학습이 잘 안되지 않을까 했는데, 되는게 신기하네요.
전 Prior loss는 사용하고 있고 WaveNet기반 CFM 사용중인데 성능은 아직 모르겠습니다.
디스코드는 추가했습니다!

p0p4k · 2023-11-10T08:25:37Z

diff_loss, _ = self.decoder.compute_loss(x1=z_spec.detach(), mask=y_mask, mu=mu_y, spks=spks, cond=cond)
Earlier, I was not using detach, so this loss back-propogates to the spec_encoder and was causing issue.

I thought it would cause overfitting in the latent domain and CFM learning would not work well, but I'm surprised it works.
Can you be more specific in this ? I seem to not understand why it would not work well. The performance is not very clear in my audio files yet, but i will check again after 1-2 days of training.
(thanks for adding on discord, what was your id? or you can send me hi on there).

hcy71o · 2023-11-10T09:16:40Z

If my understanding is correct, z_spec is soley trained by reconstruction loss (hifigan). Is it correct?

So I thought detaching z_spec can be viewed as training autoencoder without restriction (such as removing prior in VITS, VQ in NaturalSpeech2) -- resulting in high-variance latent space, which is hard to be estimated by the prior.

p0p4k · 2023-11-10T09:36:24Z

Yes, you are correct. I think I will try both ways - with and without prior loss and check how it goes.

p0p4k · 2023-11-20T14:09:55Z

https://github.com/lmnt-com/wavegrad if wavegrad can do it, we can do it too. CFM directly in the wav dimension? (am i mistaken?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About MAS (monotonic alignment search) #2

About MAS (monotonic alignment search) #2

hcy71o commented Nov 7, 2023 •

edited

Loading

p0p4k commented Nov 7, 2023

p0p4k commented Nov 10, 2023

p0p4k commented Nov 10, 2023

hcy71o commented Nov 10, 2023

p0p4k commented Nov 10, 2023

hcy71o commented Nov 10, 2023 •

edited

Loading

p0p4k commented Nov 10, 2023

p0p4k commented Nov 20, 2023

About MAS (monotonic alignment search) #2

About MAS (monotonic alignment search) #2

Comments

hcy71o commented Nov 7, 2023 • edited Loading

p0p4k commented Nov 7, 2023

p0p4k commented Nov 10, 2023

p0p4k commented Nov 10, 2023

hcy71o commented Nov 10, 2023

p0p4k commented Nov 10, 2023

hcy71o commented Nov 10, 2023 • edited Loading

p0p4k commented Nov 10, 2023

p0p4k commented Nov 20, 2023

hcy71o commented Nov 7, 2023 •

edited

Loading

hcy71o commented Nov 10, 2023 •

edited

Loading