Questions about change ViT to 378 input resolution, but got poor results. #46

OpenJarvisAI · 2024-04-14T11:32:49Z

Hi, am alreaady tried using vit336 and convnext + Qwen LLM, which is great, and really got a good performance.

But when I try using another CLIP vit model with input size is 378, rest things are same (include traning data) the result are extremly poor.

To precisely:

the loss are lower, normally I got 0.9-1.0 , but using CLIP with input size 378, the loss can to 0.7-0.8, but the inference result are very poor;
The CLIP model I used was Apple's DNFS_vit_G_378 model.
I have changed the convnext input resuoltion accordingly.

Any reason for this? This is really weired, better and larger ViT got bad results.

hhaAndroid · 2024-04-15T01:46:49Z

same #43

yanwei-li · 2024-04-15T03:27:36Z

Hi, thanks for your report. If there are no other bugs, I guess you can try the following steps to locate the problem:

If the performance is quite low (with over 10% performance drop), there may be some bugs in the implementation.
Only apply DNFS_vit_G_378 without patch info mining to see whether the performance is satisfactory.
If previous models are all good, try to use larger ConvNext, like CLIP-convnext_xxlarge. Because better ViT requires stronger ConvNext to provide candidate key and value for reference.

OpenJarvisAI · 2024-04-15T13:19:47Z

Hi, am start to doubt Appl'es VIT is right or not, seems they just randomly post wrong weights....

meanwhile, do u have any condicates to used Vit-H or VitBigG?

yanwei-li · 2024-04-21T14:56:15Z

Hi, we donot plan to use larger ViT to retrain the model. Because it could exceed our current resources.

OpenJarvisAI · 2024-04-22T13:19:57Z

@yanwei-li Hi, which kind of path are u guys currently work on to enhance even more better performance of MGM?

Provide feedback