Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about position encoding for a larger sequence length #5

Open
pp00704831 opened this issue Jun 2, 2021 · 2 comments
Open

Comments

@pp00704831
Copy link

Hello,
In your paper, you crop an image into 48x48 patches with 3 channels followed by heads. Before features into the transformer, each feature is separated into a patch with kernel size 3 as a word to generate the 16x16 sequence (tokens).
How will you do if we input the larger sequence such as 32x32 (tokens) for the position encoding?

Looking forward to your reply, thank you!

@HantingChen
Copy link
Collaborator

What is the meaning of inputting large sequance for position encoding?

As the position encoding is added to the input patches, their size should be exactly same as the patches (16*16)

@pp00704831
Copy link
Author

pp00704831 commented Jul 5, 2021

Hello,

For image deblurring task , you use patch size as 256x256 with patch dim 8, thus the numbers of tokens are 32x32.
But your pre-trained model is trained on size 48x48 with patch dim 3 , thus the numbers of tokens are 16x16.
There might be a mismatch for the position encoding with pre-trained model.
Do you use interpolation from 16x16 to 32x32?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants