Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the text encoder? #53

Open
liuheng92 opened this issue Jan 16, 2024 · 1 comment
Open

About the text encoder? #53

liuheng92 opened this issue Jan 16, 2024 · 1 comment

Comments

@liuheng92
Copy link

Hi, Thx for the great job, but there is a discrepancy that doesn't match the content in the essay. In the essay, It says "We adopt BERT [26] as the text encoder and its parameters are trained in the first and second training stages while being frozen in the last training stage.", but the repo shows that we should download the pretrained vit model. So I am a little confused if I should use the origin vit model or the finetuned one? and where is the finetuned text encoder model?

@MasterBin-IIAU
Copy link
Owner

Hi, ViT should be downloaded only when you want to use it as the visual encoder (visual backbone). As mentioned in the paper, we always use BERT-base as the text encoder. The code of the text encoder is here.

About the finetuning problem, we always initialize the visual encoder and the text encoder using ImageNet pretrained and Huggingface pretrained weights. During the training process, their weights will be finetuned using instance perception data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants