You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, Thx for the great job, but there is a discrepancy that doesn't match the content in the essay. In the essay, It says "We adopt BERT [26] as the text encoder and its parameters are trained in the first and second training stages while being frozen in the last training stage.", but the repo shows that we should download the pretrained vit model. So I am a little confused if I should use the origin vit model or the finetuned one? and where is the finetuned text encoder model?
The text was updated successfully, but these errors were encountered:
Hi, ViT should be downloaded only when you want to use it as the visual encoder (visual backbone). As mentioned in the paper, we always use BERT-base as the text encoder. The code of the text encoder is here.
About the finetuning problem, we always initialize the visual encoder and the text encoder using ImageNet pretrained and Huggingface pretrained weights. During the training process, their weights will be finetuned using instance perception data.
Hi, Thx for the great job, but there is a discrepancy that doesn't match the content in the essay. In the essay, It says "We adopt BERT [26] as the text encoder and its parameters are trained in the first and second training stages while being frozen in the last training stage.", but the repo shows that we should download the pretrained vit model. So I am a little confused if I should use the origin vit model or the finetuned one? and where is the finetuned text encoder model?
The text was updated successfully, but these errors were encountered: