You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
In your paper, you crop an image into 48x48 patches with 3 channels followed by heads. Before features into the transformer, each feature is separated into a patch with kernel size 3 as a word to generate the 16x16 sequence (tokens).
How will you do if we input the larger sequence such as 32x32 (tokens) for the position encoding?
Looking forward to your reply, thank you!
The text was updated successfully, but these errors were encountered:
For image deblurring task , you use patch size as 256x256 with patch dim 8, thus the numbers of tokens are 32x32.
But your pre-trained model is trained on size 48x48 with patch dim 3 , thus the numbers of tokens are 16x16.
There might be a mismatch for the position encoding with pre-trained model.
Do you use interpolation from 16x16 to 32x32?
Hello,
In your paper, you crop an image into 48x48 patches with 3 channels followed by heads. Before features into the transformer, each feature is separated into a patch with kernel size 3 as a word to generate the 16x16 sequence (tokens).
How will you do if we input the larger sequence such as 32x32 (tokens) for the position encoding?
Looking forward to your reply, thank you!
The text was updated successfully, but these errors were encountered: