You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello @SeungyounShin, thanks for your test on the zero-shot image-to-image translation.
As you mention, an autoregressive text-to-image generation model can conduct unseen tasks in the zero-shot manner, even though the training dataset does not include the exactly same types of text-image pairs ! However, the capability on the zero-shot learning increases when the model size & dataset set increase together. Please note that the released minDALL-E is yet smaller scale model (1.3B params, 14M text-image pairs) than the original implementation of OpenAI (12B params, 250M text-image pairs).
The problem will be solved when a larger scale of model is trained on a larger number of training samples, and we will also release large scale of models.
This looks great!
Could you share some information on what setup you used for the training of the transformer model?
It would be helpful to have these information to better understand the cost of training dalle models.
The text was updated successfully, but these errors were encountered: