Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why mochi diffusers video output is worse than mochi official code? #10144

Open
foreverpiano opened this issue Dec 7, 2024 · 8 comments
Open
Labels
bug Something isn't working

Comments

@foreverpiano
Copy link

Describe the bug

The quality of video is worse.

Reproduction

Run the code with official prompt

Logs

No response

System Info

diffusers@main

Who can help?

@a-r-r-o-w @yiyixuxu

@foreverpiano foreverpiano added the bug Something isn't working label Dec 7, 2024
@foreverpiano
Copy link
Author

@DN6 #10033
Has this been fixed now?

@SahilCarterr
Copy link
Contributor

Official Diffusers Code

mochi.mp4

PR #10033 Close to Merged

mochi_new.mp4

Quality has much improved , now fixed @foreverpiano

@DN6
Copy link
Collaborator

DN6 commented Dec 11, 2024

Hi @foreverpiano yes that branch should address the quality issues. We're currently looking into whether we can remove the dependency on torch autocast and still match the original repo.

@jmahajan117
Copy link

Related issue, but I don't see MochiPipeline on release 0.31. How are you generating these videos?

@hlky
Copy link
Collaborator

hlky commented Dec 14, 2024

@jmahajan117 Install from main. We are preparing for 0.32 release in the next weeks.

@foreverpiano
Copy link
Author

foreverpiano commented Dec 14, 2024

I think the issue is still not fixed. The Mochi attention does not consider adding masks (which are needed for text padding). So the result is different from the original Mochi.

The encoder query should have an attention mask because it is padded to 256. Padding with zeros is not equivalent to adding an attention mask when considering the softmax operation in attention.

hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)

Although this difference might be negligible since attention has some robustness, there are still factual errors compared to attention with cu_seq_len due to issues in the attention implementation.

cc @hlky @DN6 @a-r-r-o-w

@a-r-r-o-w
Copy link
Member

@foreverpiano
Copy link
Author

@a-r-r-o-w The implementation appears better since it takes the mask into consideration. I believe this addresses my previous concern. Let me look into this version video output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants