Why mochi diffusers video output is worse than mochi official code? #10144

foreverpiano · 2024-12-07T05:53:57Z

Describe the bug

The quality of video is worse.

Reproduction

Run the code with official prompt

Logs

No response

System Info

diffusers@main

Who can help?

@a-r-r-o-w @yiyixuxu

foreverpiano · 2024-12-07T05:55:54Z

@DN6 #10033
Has this been fixed now?

SahilCarterr · 2024-12-07T16:37:47Z

Official Diffusers Code

mochi.mp4

PR #10033 Close to Merged

mochi_new.mp4

Quality has much improved , now fixed @foreverpiano

DN6 · 2024-12-11T03:10:36Z

Hi @foreverpiano yes that branch should address the quality issues. We're currently looking into whether we can remove the dependency on torch autocast and still match the original repo.

jmahajan117 · 2024-12-13T20:40:25Z

Related issue, but I don't see MochiPipeline on release 0.31. How are you generating these videos?

hlky · 2024-12-14T08:43:15Z

@jmahajan117 Install from main. We are preparing for 0.32 release in the next weeks.

foreverpiano · 2024-12-14T12:11:53Z

I think the issue is still not fixed. The Mochi attention does not consider adding masks (which are needed for text padding). So the result is different from the original Mochi.

The encoder query should have an attention mask because it is padded to 256. Padding with zeros is not equivalent to adding an attention mask when considering the softmax operation in attention.

diffusers/src/diffusers/models/attention_processor.py

Line 3928 in 6324340

    
           hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)

Although this difference might be negligible since attention has some robustness, there are still factual errors compared to attention with cu_seq_len due to issues in the attention implementation.

cc @hlky @DN6 @a-r-r-o-w

a-r-r-o-w · 2024-12-14T12:15:36Z

@foreverpiano Could you use this branch? https://github.com/huggingface/diffusers/tree/mochi-quality

diffusers/src/diffusers/models/transformers/transformer_mochi.py

Line 278 in 09fe7ec

mask = attention_mask[idx][None, :]

foreverpiano · 2024-12-14T12:31:56Z

@a-r-r-o-w The implementation appears better since it takes the mask into consideration. I believe this addresses my previous concern. Let me look into this version video output.

foreverpiano added the bug Something isn't working label Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why mochi diffusers video output is worse than mochi official code? #10144

Why mochi diffusers video output is worse than mochi official code? #10144

foreverpiano commented Dec 7, 2024

foreverpiano commented Dec 7, 2024

SahilCarterr commented Dec 7, 2024

DN6 commented Dec 11, 2024

jmahajan117 commented Dec 13, 2024

hlky commented Dec 14, 2024

foreverpiano commented Dec 14, 2024 •

edited

Loading

a-r-r-o-w commented Dec 14, 2024

foreverpiano commented Dec 14, 2024

Why mochi diffusers video output is worse than mochi official code? #10144

Why mochi diffusers video output is worse than mochi official code? #10144

Comments

foreverpiano commented Dec 7, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

foreverpiano commented Dec 7, 2024

SahilCarterr commented Dec 7, 2024

Official Diffusers Code

PR #10033 Close to Merged

DN6 commented Dec 11, 2024

jmahajan117 commented Dec 13, 2024

hlky commented Dec 14, 2024

foreverpiano commented Dec 14, 2024 • edited Loading

a-r-r-o-w commented Dec 14, 2024

foreverpiano commented Dec 14, 2024

foreverpiano commented Dec 14, 2024 •

edited

Loading