Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About MAS (monotonic alignment search) #2

Open
hcy71o opened this issue Nov 7, 2023 · 8 comments
Open

About MAS (monotonic alignment search) #2

hcy71o opened this issue Nov 7, 2023 · 8 comments

Comments

@hcy71o
Copy link

hcy71o commented Nov 7, 2023

I've also implemented an E2E system using a CFM prior (different flow matching architecture instead of 1D-Unet). Despite using the prior loss in Grad-TTS, alignment framework fails to converge. (Prior loss between text encoder output and latent variable z_0) Has anyone managed to solve this problem without using an external aligner (MFA)?

@p0p4k
Copy link
Owner

p0p4k commented Nov 7, 2023

Hi, can you help me understand few things about CFM, if it is not a problem? Add me on discord p0p4k .

@p0p4k
Copy link
Owner

p0p4k commented Nov 10, 2023

at my current state of the repo, it seems to converge on my private dataset.
image

@p0p4k
Copy link
Owner

p0p4k commented Nov 10, 2023

I removed prior loss as i think it is kind of restricting the model's capacity. This is without prior guidance. It may or may not be good, time will tell.
image

@hcy71o
Copy link
Author

hcy71o commented Nov 10, 2023

Thanks for sharing. I have one question after looking at the code, is the generation performance good even detaching the posterior output? I thought it would cause overfitting in the latent domain and CFM learning would not work well, but I'm surprised it works. I'm currently using Prior loss and WaveNet-based CFM, but I don't know the performance yet.

diff_loss, _ = self.decoder.compute_loss(x1=z_spec.detach(), mask=y_mask, mu=mu_y, spks=spks, cond=cond)

Tensorboard에 KST가 나와있어서 한국어로도 질문드립니다 (위는 DeepL 번역본)
공유 감사합니다. 코드를 본 뒤에 질문 한가지가 있는데, posterior output을 detach 해주어도 생성 성능이 잘 나오나요?
latent 도메인에서 오버피팅이 생겨서 CFM 학습이 잘 안되지 않을까 했는데, 되는게 신기하네요.
전 Prior loss는 사용하고 있고 WaveNet기반 CFM 사용중인데 성능은 아직 모르겠습니다.
디스코드는 추가했습니다!

@p0p4k
Copy link
Owner

p0p4k commented Nov 10, 2023

diff_loss, _ = self.decoder.compute_loss(x1=z_spec.detach(), mask=y_mask, mu=mu_y, spks=spks, cond=cond)
Earlier, I was not using detach, so this loss back-propogates to the spec_encoder and was causing issue.

I thought it would cause overfitting in the latent domain and CFM learning would not work well, but I'm surprised it works.
Can you be more specific in this ? I seem to not understand why it would not work well. The performance is not very clear in my audio files yet, but i will check again after 1-2 days of training.
(thanks for adding on discord, what was your id? or you can send me hi on there).

@hcy71o
Copy link
Author

hcy71o commented Nov 10, 2023

If my understanding is correct, z_spec is soley trained by reconstruction loss (hifigan). Is it correct?

So I thought detaching z_spec can be viewed as training autoencoder without restriction (such as removing prior in VITS, VQ in NaturalSpeech2) -- resulting in high-variance latent space, which is hard to be estimated by the prior.

@p0p4k
Copy link
Owner

p0p4k commented Nov 10, 2023

Yes, you are correct. I think I will try both ways - with and without prior loss and check how it goes.

@p0p4k
Copy link
Owner

p0p4k commented Nov 20, 2023

https://github.com/lmnt-com/wavegrad if wavegrad can do it, we can do it too. CFM directly in the wav dimension? (am i mistaken?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants