Question about TnT pixel embed implementation #662
Unanswered
alexander-soare
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi community! I'd love to get your thoughts on a paper <> code discrepancy I've picked up.
First of all, let's talk about ViT. There we do a patch embed by taking 16x16 non-overlapping windows, flattening, and projecting to a target dimension (to match the transformer model dimension). We could actually do these steps, or be clever like in the code and use:
Because
kernel_size == stride == patch_size
we are effectively doing the same thing as the steps I mentioned.Now as I understand from the TnT paper, "each patch is further transformed into the target size (p', p') with pixel unfold, and with a linear projection". The code looks like it utilises the same trick with
Conv2d
but if you look carefully, you realise thatkernel_size != stride
, and now we usepadding
. So now our windows are overlappingI couldn't find where in the paper they refer to this. Don't get me wrong, I like it because it it adds more inductive bias into the mix, but I do want to understand the discrepancy.
Beta Was this translation helpful? Give feedback.
All reactions