Questions about how to enlarge the base vision tower input resolution #48

lucasjinreal · 2024-04-15T07:27:11Z

Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.

I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).

Could that be possible? Can u give me some advisor how to adopt it?

yanwei-li · 2024-04-16T13:51:35Z

Hi, of course, it is possible. The main question is how to split the input image. A simple solution is to divide 448x448 to 4x336x336 images with overlap. However, the computational cost is actually the same as that of 672x672.
In this case, I guess you can try to use 4x224x224 to represent 448, which brings low cost but slightly larger resolution.

lucasjinreal · 2024-04-17T02:42:05Z

Hi, how about directly make input 448, and then interpolate for clipvit, that's would add more visual tokens, is that possible and will enhance performance?

yanwei-li · 2024-04-21T14:59:21Z

Hi, if we directly enlarge the input to 448, we need to fine-tune the ViT model. It could bring inferior performance if without a large amount of data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about how to enlarge the base vision tower input resolution #48

Questions about how to enlarge the base vision tower input resolution #48

lucasjinreal commented Apr 15, 2024

yanwei-li commented Apr 16, 2024

lucasjinreal commented Apr 17, 2024

yanwei-li commented Apr 21, 2024

Questions about how to enlarge the base vision tower input resolution #48

Questions about how to enlarge the base vision tower input resolution #48

Comments

lucasjinreal commented Apr 15, 2024

yanwei-li commented Apr 16, 2024

lucasjinreal commented Apr 17, 2024

yanwei-li commented Apr 21, 2024