Question about TnT pixel embed implementation #662

alexander-soare · 2021-05-26T10:40:07Z

alexander-soare
May 26, 2021

Hi community! I'd love to get your thoughts on a paper <> code discrepancy I've picked up.

First of all, let's talk about ViT. There we do a patch embed by taking 16x16 non-overlapping windows, flattening, and projecting to a target dimension (to match the transformer model dimension). We could actually do these steps, or be clever like in the code and use:

self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)

Because kernel_size == stride == patch_size we are effectively doing the same thing as the steps I mentioned.

Now as I understand from the TnT paper, "each patch is further transformed into the target size (p', p') with pixel unfold, and with a linear projection". The code looks like it utilises the same trick with Conv2d but if you look carefully, you realise that kernel_size != stride, and now we use padding. So now our windows are overlapping

stride = 4
self.proj = nn.Conv2d(in_chans, self.in_dim, kernel_size=7, padding=3, stride=stride)

I couldn't find where in the paper they refer to this. Don't get me wrong, I like it because it it adds more inductive bias into the mix, but I do want to understand the discrepancy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about TnT pixel embed implementation #662

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Question about TnT pixel embed implementation #662

alexander-soare May 26, 2021

Replies: 0 comments

alexander-soare
May 26, 2021