-
@rwightman I was hoping to pick your brains on MLP Mixer. Recently saw your tweet about being able to replace all MLP layers with just convs - https://twitter.com/wightmanr/status/1391850999694848001. There's few discussions on Twitter - Yann Le Cun saying MLP Mixer module is pretty much 1x1 convs. See here. But there's also contrary opinion that calling MLP Mixer as a conv is a bit of stretch. See here. Keen to hear your opinion? Could I replace all MLP Mixer blocks with just convs? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@amaarora I think it's clear from the twitter convo by folk who've been at this much longer than myself, it's a touchy subject that boils down to semantics. You can replace fully connected/dense layers with 1x1 convs pretty much anywhere, you cannot replace a 1x1 with a dense layer if you expect/need the output spatial dim to vary with input. Some CNN impl have used 1x1 conv instead of a FC/linear for a long time. Convolution is clearly a more flexible operation than a dense/fc (matmul). If you reduced the form of every operation in a net to it's simplest (assuming you won't want to keep spatial output dim that vary with input, but have fixed singleton spatial dims), then all 1x1 would end up fc/dense. Similarly for the patch embedding layer (patchification + projection), it can be done using a convolution with kernel_size=stride since it's such a flexible operation. But you can also break an image into non overlapping patches and project with reshape/view + matmul. So again you can get into a semantics argument that is' a convolution because it uses/or can use a conv, but the most simple form of that operation, as used in these networks, cannot cover the functionality of a convolution. So, in my view these networks are not 'convolutional neural networks'. In my books that should be reserved for networks, built from sequences of convolutions that can, up until the last fully connected layer(s) or other 'rigid' layers like many self attention, can vary with input size. The (spatial) output size of a CNN (up to GP + non-conv FC) varies with the input. That is not the case for MLP Mixer, the input size must be fixed even if you represent the dense layers in the MLPs as 1x1 convs. |
Beta Was this translation helpful? Give feedback.
@amaarora I think it's clear from the twitter convo by folk who've been at this much longer than myself, it's a touchy subject that boils down to semantics.
You can replace fully connected/dense layers with 1x1 convs pretty much anywhere, you cannot replace a 1x1 with a dense layer if you expect/need the output spatial dim to vary with input. Some CNN impl have used 1x1 conv instead of a FC/linear for a long time. Convolution is clearly a more flexible operation than a dense/fc (matmul). If you reduced the form of every operation in a net to it's simplest (assuming you won't want to keep spatial output dim that vary with input, but have fixed singleton spatial dims), then all 1x1 would en…