About the text encoder? #53

liuheng92 · 2024-01-16T02:07:56Z

Hi, Thx for the great job, but there is a discrepancy that doesn't match the content in the essay. In the essay, It says "We adopt BERT [26] as the text encoder and its parameters are trained in the first and second training stages while being frozen in the last training stage.", but the repo shows that we should download the pretrained vit model. So I am a little confused if I should use the origin vit model or the finetuned one? and where is the finetuned text encoder model?

MasterBin-IIAU · 2024-01-16T08:42:47Z

Hi, ViT should be downloaded only when you want to use it as the visual encoder (visual backbone). As mentioned in the paper, we always use BERT-base as the text encoder. The code of the text encoder is here.

About the finetuning problem, we always initialize the visual encoder and the text encoder using ImageNet pretrained and Huggingface pretrained weights. During the training process, their weights will be finetuned using instance perception data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the text encoder? #53

About the text encoder? #53

liuheng92 commented Jan 16, 2024

MasterBin-IIAU commented Jan 16, 2024

About the text encoder? #53

About the text encoder? #53

Comments

liuheng92 commented Jan 16, 2024

MasterBin-IIAU commented Jan 16, 2024