Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about how to enlarge the base vision tower input resolution #48

Open
lucasjinreal opened this issue Apr 15, 2024 · 3 comments

Comments

@lucasjinreal
Copy link

Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.

I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).

Could that be possible? Can u give me some advisor how to adopt it?

@yanwei-li
Copy link
Member

Hi, of course, it is possible. The main question is how to split the input image. A simple solution is to divide 448x448 to 4x336x336 images with overlap. However, the computational cost is actually the same as that of 672x672.
In this case, I guess you can try to use 4x224x224 to represent 448, which brings low cost but slightly larger resolution.

@lucasjinreal
Copy link
Author

Hi, how about directly make input 448, and then interpolate for clipvit, that's would add more visual tokens, is that possible and will enhance performance?

@yanwei-li
Copy link
Member

Hi, if we directly enlarge the input to 448, we need to fine-tune the ViT model. It could bring inferior performance if without a large amount of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants