Skip to content

Commit

Permalink
Add TF swiftformer (#23342)
Browse files Browse the repository at this point in the history
* Duplicate swiftformer

* Convert SwiftFormerPatchEmbedding

* Convert SwiftFormerEmbeddings

* Convert TFSwiftFormerMlp

* Convert TFSwiftFormerConvEncoder

* Convert TFSwiftFormerLocalRepresentation

* convert TFSwiftFormerEncoderBlock

* Convert SwiftFormerStage

* Convert SwiftFormerEncoder

* Add TFSWiftFormerPreTrainedModel

* Convert SwiftFormerForImageClassification

* Add kwargs and start drop path

* Fix syntax

* Change Model class name

* Add TFSwiftFormer to __init__

* Duplicate test_modeling_swiftformer

* First test conversions

* Change require_torch to require_tf

* Add exports to swiftformer __init__

* Add TFSwiftFormerModel wrapper

* Fix __init__ and run black

* Remove docstring from MainLayer, fix padding

* Use keras.layers.Activation on keras.Sequential

* Fix swiftformer exports

* Fix activation layer from config

* Remove post_inits

* Use tf.keras.layers.ZeroPadding2D

* Convert torch normalize

* Change tf test input shape

* Fix softmax and reduce_sum

* Convert expand_dims and repeat

* Add missing reshape and tranpose

* Simplify TFSwiftFormerEncoderBlock.call

* Fix mismatch in patch embeddings

* Fix expected output shape to match channels last

* Fix swiftformer typo

* Disable test_onnx

* Fix TFSwiftFormerForImageClassification call

* Add unpack inputs

* Convert flatten(2).mean(-1)

* Change vision dummy inputs (to be reviewed)

* Change test_forward_signature to use .call

* Fix @unpack_inputs

* Set return_tensors="tf" and rename class

* Rename wrongly named patch_embeddings layer

* Add serving_output and change dummy_input shape

* Make dimensions BCHW and transpose inside embedding layer

* Change SwiftFormerEncoderBlock

* Fix ruff problems

* Add image size to swiftformer config

* Change tranpose to MainLayer and use -1 for reshape

* Remove serving_outputs and dummy_inputs

* Remove test_initialization test from tf model

* Make Sequential component a separate layer

* Fix layers' names

* Tranpose encoder outputs

* Fix tests and check if hidden states is not None

* Fix TFSwiftFormerForImageClassification

* Run make fixup

* Run make fix-copies

* Update modeling_tf_auto

* Update docs

* Fix modeling auto mapping

* Update modelint_tf_swiftformer docs

* Fill image_size doc and type

* Add reduction=None to loss computation

* Update docs

* make style

* Debug: Delete the tip to see if that changes anything

* Re-add tip

* Remove add_code_sample_docstrings

* Remove unused import

* Get the debug to actually tell us the problem it has with the docs

* Try a substitution to match the PyTorch file?

* Add swiftformer to ignore list

* Add build() methods

* Update copyright year

Co-authored-by: amyeroberts <[email protected]>

* Remove FIXME comment

* Remove from_pt

* Update copyright year

Co-authored-by: amyeroberts <[email protected]>

* Rename one-letter variables

* Remove FIXMEs related to momentum

* Remove old TODO comment

* Remove outstanding FIXME comments

* Get dropout rate from config

* Add specific dropout config for MLP

* Add convencoder dropout to config

* Pass config to SwiftFormerDropPath layer

* Fix drop_path variable name and add Adapted from comment

* Run ruff

* Removed copied from comment

* Run fix copies

* Change drop_path to identity to match pt

* Cleanup build() methods and move to new keras imports

* Update docs/source/en/model_doc/swiftformer.md

Co-authored-by: Matt <[email protected]>

* Raise error if drop_path_rate > 0.0

* Apply suggestions from code review

Replace (self.dim), with self.dim,

Co-authored-by: Matt <[email protected]>

* Remove drop_path function

* Add training to TFSwiftFormerEncoder

* Set self.built = True last

Co-authored-by: amyeroberts <[email protected]>

* Should have been added to previous commit

Co-authored-by: amyeroberts <[email protected]>

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* Change default_feature_extractor to default_image_processor

Co-authored-by: amyeroberts <[email protected]>

* Import Keras from modeling_tf_utils

* Remove relative import

* Run ruff --fix

* Move import keras to tf_available

* Add copied from comment to test_forward_signature

* Reduce batch size and num_labels

* Extract loss logic to hf_compute_loss

* Run ruff format

---------

Co-authored-by: Matt <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: Matt <[email protected]>
  • Loading branch information
4 people authored and ArthurZucker committed Apr 22, 2024
1 parent d447753 commit 20f55a1
Show file tree
Hide file tree
Showing 11 changed files with 1,244 additions and 20 deletions.
2 changes: 1 addition & 1 deletion docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ Flax), PyTorch, and/or TensorFlow.
| [StableLm](model_doc/stablelm) ||||
| [Starcoder2](model_doc/starcoder2) ||||
| [SuperPoint](model_doc/superpoint) ||||
| [SwiftFormer](model_doc/swiftformer) || ||
| [SwiftFormer](model_doc/swiftformer) || ||
| [Swin Transformer](model_doc/swin) ||||
| [Swin Transformer V2](model_doc/swinv2) ||||
| [Swin2SR](model_doc/swin2sr) ||||
Expand Down
12 changes: 11 additions & 1 deletion docs/source/en/model_doc/swiftformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The abstract from the paper is the following:

*Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.*

This model was contributed by [shehan97](https://huggingface.co/shehan97).
This model was contributed by [shehan97](https://huggingface.co/shehan97). The TensorFlow version was contributed by [joaocmd](https://huggingface.co/joaocmd).
The original code can be found [here](https://github.com/Amshaker/SwiftFormer).

## SwiftFormerConfig
Expand All @@ -42,3 +42,13 @@ The original code can be found [here](https://github.com/Amshaker/SwiftFormer).

[[autodoc]] SwiftFormerForImageClassification
- forward

## TFSwiftFormerModel

[[autodoc]] TFSwiftFormerModel
- call

## TFSwiftFormerForImageClassification

[[autodoc]] TFSwiftFormerForImageClassification
- call
14 changes: 14 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4517,6 +4517,14 @@
"TFSpeech2TextPreTrainedModel",
]
)
_import_structure["models.swiftformer"].extend(
[
"TF_SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
"TFSwiftFormerForImageClassification",
"TFSwiftFormerModel",
"TFSwiftFormerPreTrainedModel",
]
)
_import_structure["models.swin"].extend(
[
"TF_SWIN_PRETRAINED_MODEL_ARCHIVE_LIST",
Expand Down Expand Up @@ -8901,6 +8909,12 @@
TFSpeech2TextModel,
TFSpeech2TextPreTrainedModel,
)
from .models.swiftformer import (
TF_SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
TFSwiftFormerForImageClassification,
TFSwiftFormerModel,
TFSwiftFormerPreTrainedModel,
)
from .models.swin import (
TF_SWIN_PRETRAINED_MODEL_ARCHIVE_LIST,
TFSwinForImageClassification,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_tf_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@
("sam", "TFSamModel"),
("segformer", "TFSegformerModel"),
("speech_to_text", "TFSpeech2TextModel"),
("swiftformer", "TFSwiftFormerModel"),
("swin", "TFSwinModel"),
("t5", "TFT5Model"),
("tapas", "TFTapasModel"),
Expand Down Expand Up @@ -213,6 +214,7 @@
("regnet", "TFRegNetForImageClassification"),
("resnet", "TFResNetForImageClassification"),
("segformer", "TFSegformerForImageClassification"),
("swiftformer", "TFSwiftFormerForImageClassification"),
("swin", "TFSwinForImageClassification"),
("vit", "TFViTForImageClassification"),
]
Expand Down
26 changes: 26 additions & 0 deletions src/transformers/models/swiftformer/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_tf_available,
is_torch_available,
)

Expand All @@ -41,6 +42,19 @@
"SwiftFormerPreTrainedModel",
]

try:
if not is_tf_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_tf_swiftformer"] = [
"TF_SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
"TFSwiftFormerForImageClassification",
"TFSwiftFormerModel",
"TFSwiftFormerPreTrainedModel",
]

if TYPE_CHECKING:
from .configuration_swiftformer import (
SWIFTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
Expand All @@ -60,6 +74,18 @@
SwiftFormerModel,
SwiftFormerPreTrainedModel,
)
try:
if not is_tf_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_tf_swiftformer import (
TF_SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
TFSwiftFormerForImageClassification,
TFSwiftFormerModel,
TFSwiftFormerPreTrainedModel,
)

else:
import sys
Expand Down
12 changes: 12 additions & 0 deletions src/transformers/models/swiftformer/configuration_swiftformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ class SwiftFormerConfig(PretrainedConfig):
Args:
image_size (`int`, *optional*, defaults to 224):
The size (resolution) of each image
num_channels (`int`, *optional*, defaults to 3):
The number of input channels
depths (`List[int]`, *optional*, defaults to `[3, 3, 6, 4]`):
Expand All @@ -62,6 +64,10 @@ class SwiftFormerConfig(PretrainedConfig):
Padding in downsampling layers.
drop_path_rate (`float`, *optional*, defaults to 0.0):
Rate at which to increase dropout probability in DropPath.
drop_mlp_rate (`float`, *optional*, defaults to 0.0):
Dropout rate for the MLP component of SwiftFormer.
drop_conv_encoder_rate (`float`, *optional*, defaults to 0.0):
Dropout rate for the ConvEncoder component of SwiftFormer.
use_layer_scale (`bool`, *optional*, defaults to `True`):
Whether to scale outputs from token mixers.
layer_scale_init_value (`float`, *optional*, defaults to 1e-05):
Expand Down Expand Up @@ -89,6 +95,7 @@ class SwiftFormerConfig(PretrainedConfig):

def __init__(
self,
image_size=224,
num_channels=3,
depths=[3, 3, 6, 4],
embed_dims=[48, 56, 112, 220],
Expand All @@ -99,12 +106,15 @@ def __init__(
down_stride=2,
down_pad=1,
drop_path_rate=0.0,
drop_mlp_rate=0.0,
drop_conv_encoder_rate=0.0,
use_layer_scale=True,
layer_scale_init_value=1e-5,
batch_norm_eps=1e-5,
**kwargs,
):
super().__init__(**kwargs)
self.image_size = image_size
self.num_channels = num_channels
self.depths = depths
self.embed_dims = embed_dims
Expand All @@ -115,6 +125,8 @@ def __init__(
self.down_stride = down_stride
self.down_pad = down_pad
self.drop_path_rate = drop_path_rate
self.drop_mlp_rate = drop_mlp_rate
self.drop_conv_encoder_rate = drop_conv_encoder_rate
self.use_layer_scale = use_layer_scale
self.layer_scale_init_value = layer_scale_init_value
self.batch_norm_eps = batch_norm_eps
Expand Down
27 changes: 9 additions & 18 deletions src/transformers/models/swiftformer/modeling_swiftformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,13 +103,12 @@ def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = Fals
return output


# Copied from transformers.models.beit.modeling_beit.BeitDropPath with Beit->Swiftformer
class SwiftFormerDropPath(nn.Module):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""

def __init__(self, drop_prob: Optional[float] = None) -> None:
def __init__(self, config: SwiftFormerConfig) -> None:
super().__init__()
self.drop_prob = drop_prob
self.drop_prob = config.drop_path_rate

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
return drop_path(hidden_states, self.drop_prob, self.training)
Expand Down Expand Up @@ -169,7 +168,7 @@ def __init__(self, config: SwiftFormerConfig, dim: int):
self.point_wise_conv1 = nn.Conv2d(dim, hidden_dim, kernel_size=1)
self.act = nn.GELU()
self.point_wise_conv2 = nn.Conv2d(hidden_dim, dim, kernel_size=1)
self.drop_path = nn.Identity()
self.drop_path = nn.Dropout(p=config.drop_conv_encoder_rate)
self.layer_scale = nn.Parameter(torch.ones(dim).unsqueeze(-1).unsqueeze(-1), requires_grad=True)

def forward(self, x):
Expand Down Expand Up @@ -200,7 +199,7 @@ def __init__(self, config: SwiftFormerConfig, in_features: int):
act_layer = ACT2CLS[config.hidden_act]
self.act = act_layer()
self.fc2 = nn.Conv2d(hidden_features, in_features, 1)
self.drop = nn.Dropout(p=0.0)
self.drop = nn.Dropout(p=config.drop_mlp_rate)

def forward(self, x):
x = self.norm1(x)
Expand Down Expand Up @@ -302,7 +301,7 @@ def __init__(self, config: SwiftFormerConfig, dim: int, drop_path: float = 0.0)
self.local_representation = SwiftFormerLocalRepresentation(config, dim=dim)
self.attn = SwiftFormerEfficientAdditiveAttention(config, dim=dim)
self.linear = SwiftFormerMlp(config, in_features=dim)
self.drop_path = SwiftFormerDropPath(drop_path) if drop_path > 0.0 else nn.Identity()
self.drop_path = SwiftFormerDropPath(config) if drop_path > 0.0 else nn.Identity()
self.use_layer_scale = use_layer_scale
if use_layer_scale:
self.layer_scale_1 = nn.Parameter(
Expand All @@ -315,21 +314,13 @@ def __init__(self, config: SwiftFormerConfig, dim: int, drop_path: float = 0.0)
def forward(self, x):
x = self.local_representation(x)
batch_size, channels, height, width = x.shape
res = self.attn(x.permute(0, 2, 3, 1).reshape(batch_size, height * width, channels))
res = res.reshape(batch_size, height, width, channels).permute(0, 3, 1, 2)
if self.use_layer_scale:
x = x + self.drop_path(
self.layer_scale_1
* self.attn(x.permute(0, 2, 3, 1).reshape(batch_size, height * width, channels))
.reshape(batch_size, height, width, channels)
.permute(0, 3, 1, 2)
)
x = x + self.drop_path(self.layer_scale_1 * res)
x = x + self.drop_path(self.layer_scale_2 * self.linear(x))

else:
x = x + self.drop_path(
self.attn(x.permute(0, 2, 3, 1).reshape(batch_size, height * width, channels))
.reshape(batch_size, height, width, channels)
.permute(0, 3, 1, 2)
)
x = x + self.drop_path(res)
x = x + self.drop_path(self.linear(x))
return x

Expand Down
Loading

0 comments on commit 20f55a1

Please sign in to comment.