You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.
I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).
Could that be possible? Can u give me some advisor how to adopt it?
The text was updated successfully, but these errors were encountered:
Hi, of course, it is possible. The main question is how to split the input image. A simple solution is to divide 448x448 to 4x336x336 images with overlap. However, the computational cost is actually the same as that of 672x672.
In this case, I guess you can try to use 4x224x224 to represent 448, which brings low cost but slightly larger resolution.
Hi, how about directly make input 448, and then interpolate for clipvit, that's would add more visual tokens, is that possible and will enhance performance?
Hi, if we directly enlarge the input to 448, we need to fine-tune the ViT model. It could bring inferior performance if without a large amount of data.
Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.
I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).
Could that be possible? Can u give me some advisor how to adopt it?
The text was updated successfully, but these errors were encountered: