You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your interest. Our current code supports both image and text inputs simultaneously. When only an image is input, it functions as an image-to-video model, but the quality of the generated video may degrade.
Thanks for your interest. Our current code supports both image and text inputs simultaneously. When only an image is input, it functions as an image-to-video model, but the quality of the generated video may degrade.
I have a question. When only an image is input, how should the input text be set? Can it be set to null? How does text cross attention work?
Thanks a lot.
the input text will be set to "" (Empty string). At this time, all T5 tokens are padding tokens, which means they do not function as text. Of course this will drastically reduce the quality of the generated video
the input text will be set to "" (Empty string). At this time, all T5 tokens are padding tokens, which means they do not function as text. Of course this will drastically reduce the quality of the generated video
Hi, Thanks for your amazing work! May i ask is there any plan to open source the image-to-video model and related code?
The text was updated successfully, but these errors were encountered: