The EfficientFormer model was proposed in EfficientFormer: Vision Transformers at MobileNet Speed by Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. EfficientFormer proposes a dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object detection and semantic segmentation.
The abstract from the paper is the following:
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which { runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1),} and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.
This model was contributed by novice03 and Bearnardd. The original code can be found here. The TensorFlow version of this model was added by D-Roberts.
( depths: List = [3, 2, 6, 4] hidden_sizes: List = [48, 96, 224, 448] downsamples: List = [True, True, True, True] dim: int = 448 key_dim: int = 32 attention_ratio: int = 4 resolution: int = 7 num_hidden_layers: int = 5 num_attention_heads: int = 8 mlp_expansion_ratio: int = 4 hidden_dropout_prob: float = 0.0 patch_size: int = 16 num_channels: int = 3 pool_size: int = 3 downsample_patch_size: int = 3 downsample_stride: int = 2 downsample_pad: int = 1 drop_path_rate: float = 0.0 num_meta3d_blocks: int = 1 distillation: bool = True use_layer_scale: bool = True layer_scale_init_value: float = 1e-05 hidden_act: str = 'gelu' initializer_range: float = 0.02 layer_norm_eps: float = 1e-12 image_size: int = 224 batch_norm_eps: float = 1e-05 **kwargs )
Parameters
List(int)
, optional, defaults to [3, 2, 6, 4]
) —
Depth of each stage. List(int)
, optional, defaults to [48, 96, 224, 448]
) —
Dimensionality of each stage. List(bool)
, optional, defaults to [True, True, True, True]
) —
Whether or not to downsample inputs between two stages. int
, optional, defaults to 448) —
Number of channels in Meta3D layers int
, optional, defaults to 32) —
The size of the key in meta3D block. int
, optional, defaults to 4) —
Ratio of the dimension of the query and value to the dimension of the key in MSHA block int
, optional, defaults to 7) —
Size of each patch int
, optional, defaults to 5) —
Number of hidden layers in the Transformer encoder. int
, optional, defaults to 8) —
Number of attention heads for each attention layer in the 3D MetaBlock. int
, optional, defaults to 4) —
Ratio of size of the hidden dimensionality of an MLP to the dimensionality of its input. float
, optional, defaults to 0.1) —
The dropout probability for all fully connected layers in the embeddings and encoder. int
, optional, defaults to 16) —
The size (resolution) of each patch. int
, optional, defaults to 3) —
The number of input channels. int
, optional, defaults to 3) —
Kernel size of pooling layers. int
, optional, defaults to 3) —
The size of patches in downsampling layers. int
, optional, defaults to 2) —
The stride of convolution kernels in downsampling layers. int
, optional, defaults to 1) —
Padding in downsampling layers. int
, optional, defaults to 0) —
Rate at which to increase dropout probability in DropPath. int
, optional, defaults to 1) —
The number of 3D MetaBlocks in the last stage. bool
, optional, defaults to True
) —
Whether to add a distillation head. bool
, optional, defaults to True
) —
Whether to scale outputs from token mixers. float
, optional, defaults to 1e-5) —
Factor by which outputs from token mixers are scaled. str
or function
, optional, defaults to "gelu"
) —
The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
,
"relu"
, "selu"
and "gelu_new"
are supported. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-12) —
The epsilon used by the layer normalization layers. int
, optional, defaults to 224
) —
The size (resolution) of each image. This is the configuration class to store the configuration of an EfficientFormerModel. It is used to instantiate an EfficientFormer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the EfficientFormer snap-research/efficientformer-l1 architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import EfficientFormerConfig, EfficientFormerModel
>>> # Initializing a EfficientFormer efficientformer-l1 style configuration
>>> configuration = EfficientFormerConfig()
>>> # Initializing a EfficientFormerModel (with random weights) from the efficientformer-l3 style configuration
>>> model = EfficientFormerModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
( do_resize: bool = True size: Optional = None resample: Resampling = <Resampling.BICUBIC: 3> do_center_crop: bool = True do_rescale: bool = True rescale_factor: Union = 0.00392156862745098 crop_size: Dict = None do_normalize: bool = True image_mean: Union = None image_std: Union = None **kwargs )
Parameters
bool
, optional, defaults to True
) —
Whether to resize the image’s (height, width) dimensions to the specified (size["height"], size["width"])
. Can be overridden by the do_resize
parameter in the preprocess
method. dict
, optional, defaults to {"height" -- 224, "width": 224}
):
Size of the output image after resizing. Can be overridden by the size
parameter in the preprocess
method. PILImageResampling
, optional, defaults to PILImageResampling.BILINEAR
) —
Resampling filter to use if resizing the image. Can be overridden by the resample
parameter in the
preprocess
method. bool
, optional, defaults to True
) —
Whether to center crop the image to the specified crop_size
. Can be overridden by do_center_crop
in the
preprocess
method. Dict[str, int]
optional, defaults to 224) —
Size of the output image after applying center_crop
. Can be overridden by crop_size
in the preprocess
method. bool
, optional, defaults to True
) —
Whether to rescale the image by the specified scale rescale_factor
. Can be overridden by the do_rescale
parameter in the preprocess
method. int
or float
, optional, defaults to 1/255
) —
Scale factor to use if rescaling the image. Can be overridden by the rescale_factor
parameter in the
preprocess
method.
do_normalize —
Whether to normalize the image. Can be overridden by the do_normalize
parameter in the preprocess
method. float
or List[float]
, optional, defaults to IMAGENET_STANDARD_MEAN
) —
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the image_mean
parameter in the preprocess
method. float
or List[float]
, optional, defaults to IMAGENET_STANDARD_STD
) —
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the image_std
parameter in the preprocess
method. Constructs a EfficientFormer image processor.
( images: Union do_resize: Optional = None size: Dict = None resample: Resampling = None do_center_crop: bool = None crop_size: int = None do_rescale: Optional = None rescale_factor: Optional = None do_normalize: Optional = None image_mean: Union = None image_std: Union = None return_tensors: Union = None data_format: Union = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None **kwargs )
Parameters
ImageInput
) —
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set do_rescale=False
. bool
, optional, defaults to self.do_resize
) —
Whether to resize the image. Dict[str, int]
, optional, defaults to self.size
) —
Dictionary in the format {"height": h, "width": w}
specifying the size of the output image after
resizing. PILImageResampling
filter, optional, defaults to self.resample
) —
PILImageResampling
filter to use if resizing the image e.g. PILImageResampling.BILINEAR
. Only has
an effect if do_resize
is set to True
. bool
, optional, defaults to self.do_center_crop
) —
Whether to center crop the image. bool
, optional, defaults to self.do_rescale
) —
Whether to rescale the image values between [0 - 1]. float
, optional, defaults to self.rescale_factor
) —
Rescale factor to rescale the image by if do_rescale
is set to True
. Dict[str, int]
, optional, defaults to self.crop_size
) —
Size of the center crop. Only has an effect if do_center_crop
is set to True
. bool
, optional, defaults to self.do_normalize
) —
Whether to normalize the image. float
or List[float]
, optional, defaults to self.image_mean
) —
Image mean to use if do_normalize
is set to True
. float
or List[float]
, optional, defaults to self.image_std
) —
Image standard deviation to use if do_normalize
is set to True
. str
or TensorType
, optional) —
The type of tensors to return. Can be one of:np.ndarray
.TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) —
The channel dimension format for the output image. Can be one of:"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.ChannelDimension
or str
, optional) —
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format."none"
or ChannelDimension.NONE
: image in (height, width) format.Preprocess an image or batch of images.
( config: EfficientFormerConfig )
Parameters
The bare EfficientFormer Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using ViTImageProcessor. See
ViTImageProcessor.preprocess() for details. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (EfficientFormerConfig) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor
of shape (batch_size, hidden_size)
) — Last layer hidden-state of the first token of the sequence (classification token) after further processing
through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
the classification token after processing through a linear layer and a tanh activation function. The linear
layer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The EfficientFormerModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, EfficientFormerModel
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = EfficientFormerModel.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 49, 448]
( config: EfficientFormerConfig )
Parameters
EfficientFormer Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.
This model is a PyTorch nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( pixel_values: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)
Parameters
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using ViTImageProcessor. See
ViTImageProcessor.preprocess() for details. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size,)
, optional) —
Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If
config.num_labels > 1
a classification loss is computed (Cross-Entropy). Returns
transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.ImageClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (EfficientFormerConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each stage) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states
(also called feature maps) of the model at the output of each stage.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The EfficientFormerForImageClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, EfficientFormerForImageClassification
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = EfficientFormerForImageClassification.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
Egyptian cat
( config: EfficientFormerConfig )
Parameters
EfficientFormer Model transformer with image classification heads on top (a linear layer on top of the final hidden state of the [CLS] token and a linear layer on top of the final hidden state of the distillation token) e.g. for ImageNet.
This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet supported.
This model is a PyTorch nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.models.efficientformer.modeling_efficientformer.EfficientFormerForImageClassificationWithTeacherOutput
or tuple(torch.FloatTensor)
Parameters
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using ViTImageProcessor. See
ViTImageProcessor.preprocess() for details. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.models.efficientformer.modeling_efficientformer.EfficientFormerForImageClassificationWithTeacherOutput
or tuple(torch.FloatTensor)
A transformers.models.efficientformer.modeling_efficientformer.EfficientFormerForImageClassificationWithTeacherOutput
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (EfficientFormerConfig) and inputs.
torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Prediction scores as the average of the cls_logits and distillation logits.torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the
class token).torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the
distillation token).tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of
shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer
plus the initial embedding outputs.tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.The EfficientFormerForImageClassificationWithTeacher forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, EfficientFormerForImageClassificationWithTeacher
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = EfficientFormerForImageClassificationWithTeacher.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
Egyptian cat
( config: EfficientFormerConfig **kwargs )
Parameters
The bare EfficientFormer Model transformer outputting raw hidden-states without any specific head on top. This model is a TensorFlow keras.layers.Layer. Use it as a regular TensorFlow Module and refer to the TensorFlow documentation for all matter related to general usage and behavior.
( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None training: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)
Parameters
tf.Tensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
EfficientFormerImageProcessor.call() for details. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (EfficientFormerConfig) and inputs.
last_hidden_state (tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (tf.Tensor
of shape (batch_size, hidden_size)
) — Last layer hidden-state of the first token of the sequence (classification token) further processed by a
Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence
prediction (classification) objective during pretraining.
This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape
(batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFEfficientFormerModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, TFEfficientFormerModel
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = TFEfficientFormerModel.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="tf")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 49, 448]
( config: EfficientFormerConfig )
Parameters
EfficientFormer Model transformer with an image classification head on top of pooled last hidden state, e.g. for ImageNet.
This model is a TensorFlow keras.layers.Layer. Use it as a regular TensorFlow Module and refer to the TensorFlow documentation for all matter related to general usage and behavior.
( pixel_values: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None training: bool = False ) → transformers.modeling_tf_outputs.TFImageClassifierOutput
or tuple(tf.Tensor)
Parameters
tf.Tensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
EfficientFormerImageProcessor.call() for details. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. tf.Tensor
of shape (batch_size,)
, optional) —
Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If
config.num_labels > 1
a classification loss is computed (Cross-Entropy). Returns
transformers.modeling_tf_outputs.TFImageClassifierOutput
or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFImageClassifierOutput
or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (EfficientFormerConfig) and inputs.
loss (tf.Tensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (tf.Tensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings, if the model has an embedding layer, + one for
the output of each stage) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states (also called
feature maps) of the model at the output of each stage.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFEfficientFormerForImageClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, TFEfficientFormerForImageClassification
>>> import tensorflow as tf
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = TFEfficientFormerForImageClassification.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="tf")
>>> logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = int(tf.math.argmax(logits, axis=-1))
>>> print(model.config.id2label[predicted_label])
LABEL_281
( config: EfficientFormerConfig )
Parameters
EfficientFormer Model transformer with image classification heads on top (a linear layer on top of the final hidden state and a linear layer on top of the final hidden state of the distillation token) e.g. for ImageNet.
.. warning:: This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet supported.
This model is a TensorFlow keras.layers.Layer. Use it as a regular TensorFlow Module and refer to the TensorFlow documentation for all matter related to general usage and behavior.
( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None training: bool = False ) → transformers.models.efficientformer.modeling_tf_efficientformer.TFEfficientFormerForImageClassificationWithTeacherOutput
or tuple(tf.Tensor)
Parameters
tf.Tensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
EfficientFormerImageProcessor.call() for details. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.models.efficientformer.modeling_tf_efficientformer.TFEfficientFormerForImageClassificationWithTeacherOutput
or tuple(tf.Tensor)
A transformers.models.efficientformer.modeling_tf_efficientformer.TFEfficientFormerForImageClassificationWithTeacherOutput
or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (EfficientFormerConfig) and inputs.
The TFEfficientFormerForImageClassificationWithTeacher forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
tf.Tensor
of shape (batch_size, config.num_labels)
) — Prediction scores as the average of the cls_logits and distillation logits.
cls_logits (tf.Tensor
of shape (batch_size, config.num_labels)
) — Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the
class token).
distillation_logits (tf.Tensor
of shape (batch_size, config.num_labels)
) — Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the
distillation token).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when
config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape
(batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus
the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when
config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.Example:
>>> from transformers import AutoImageProcessor, TFEfficientFormerForImageClassificationWithTeacher
>>> import tensorflow as tf
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("snap-research/efficientformer-l1-300")
>>> model = TFEfficientFormerForImageClassificationWithTeacher.from_pretrained("snap-research/efficientformer-l1-300")
>>> inputs = image_processor(image, return_tensors="tf")
>>> logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = int(tf.math.argmax(logits, axis=-1))
>>> print(model.config.id2label[predicted_label])
LABEL_281