Questions about position encoding for a larger sequence length #5

pp00704831 · 2021-06-02T04:47:53Z

Hello,
In your paper, you crop an image into 48x48 patches with 3 channels followed by heads. Before features into the transformer, each feature is separated into a patch with kernel size 3 as a word to generate the 16x16 sequence (tokens).
How will you do if we input the larger sequence such as 32x32 (tokens) for the position encoding?

Looking forward to your reply, thank you!

HantingChen · 2021-06-21T08:16:23Z

What is the meaning of inputting large sequance for position encoding?

As the position encoding is added to the input patches, their size should be exactly same as the patches (16*16)

pp00704831 · 2021-07-05T13:15:16Z

Hello,

For image deblurring task , you use patch size as 256x256 with patch dim 8, thus the numbers of tokens are 32x32.
But your pre-trained model is trained on size 48x48 with patch dim 3 , thus the numbers of tokens are 16x16.
There might be a mismatch for the position encoding with pre-trained model.
Do you use interpolation from 16x16 to 32x32?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about position encoding for a larger sequence length #5

Questions about position encoding for a larger sequence length #5

pp00704831 commented Jun 2, 2021

HantingChen commented Jun 21, 2021

pp00704831 commented Jul 5, 2021 •

edited

Loading

Questions about position encoding for a larger sequence length #5

Questions about position encoding for a larger sequence length #5

Comments

pp00704831 commented Jun 2, 2021

HantingChen commented Jun 21, 2021

pp00704831 commented Jul 5, 2021 • edited Loading

pp00704831 commented Jul 5, 2021 •

edited

Loading