Fix positional embedding resampling for non-square inputs in ViT #2317
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have been doing some experiments with ViTs for Object Detection. When I
set_input_size
to a larger, non-square size like[800, 1008]
and then expect thedynamic_img_size
option to do its job when passing a different size image as input, it breaks.When using the
dynamic_img_size
option, we dynamically resample (interpolate) positional embeddings. Theresample_abs_pos_embed
function by default assumes the previous grid is square, which I guess is true in most cases. But why assume when we can always pass the old size explicitly?pytorch-image-models/timm/layers/pos_embed.py
Lines 17 to 57 in d4dde48