The ConvNeXT model was proposed in A ConvNet for the 2020s by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.
The abstract from the paper is the following:
The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
ConvNeXT architecture. Taken from the original paper.This model was contributed by nielsr. TensorFlow version of the model was contributed by ariG23498, gante, and sayakpaul (equal contribution). The original code can be found here.
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ConvNeXT.
If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
( num_channels = 3 patch_size = 4 num_stages = 4 hidden_sizes = None depths = None hidden_act = 'gelu' initializer_range = 0.02 layer_norm_eps = 1e-12 layer_scale_init_value = 1e-06 drop_path_rate = 0.0 image_size = 224 out_features = None out_indices = None **kwargs )
Parameters
int
, optional, defaults to 3) —
The number of input channels. int
, optional, defaults to 4) —
Patch size to use in the patch embedding layer. int
, optional, defaults to 4) —
The number of stages in the model. List[int]
, optional, defaults to [96, 192, 384, 768]) —
Dimensionality (hidden size) at each stage. List[int]
, optional, defaults to [3, 3, 9, 3]) —
Depth (number of blocks) for each stage. str
or function
, optional, defaults to "gelu"
) —
The non-linear activation function (function or string) in each block. If string, "gelu"
, "relu"
,
"selu"
and "gelu_new"
are supported. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-12) —
The epsilon used by the layer normalization layers. float
, optional, defaults to 1e-6) —
The initial value for the layer scale. float
, optional, defaults to 0.0) —
The drop rate for stochastic depth. List[str]
, optional) —
If used as backbone, list of features to output. Can be any of "stem"
, "stage1"
, "stage2"
, etc.
(depending on how many stages the model has). If unset and out_indices
is set, will default to the
corresponding stages. If unset and out_indices
is unset, will default to the last stage. Must be in the
same order as defined in the stage_names
attribute. List[int]
, optional) —
If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how
many stages the model has). If unset and out_features
is set, will default to the corresponding stages.
If unset and out_features
is unset, will default to the last stage. Must be in the
same order as defined in the stage_names
attribute. This is the configuration class to store the configuration of a ConvNextModel. It is used to instantiate an ConvNeXT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ConvNeXT facebook/convnext-tiny-224 architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import ConvNextConfig, ConvNextModel
>>> # Initializing a ConvNext convnext-tiny-224 style configuration
>>> configuration = ConvNextConfig()
>>> # Initializing a model (with random weights) from the convnext-tiny-224 style configuration
>>> model = ConvNextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
( do_resize: bool = True size: Dict = None crop_pct: float = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: Union = 0.00392156862745098 do_normalize: bool = True image_mean: Union = None image_std: Union = None **kwargs )
Parameters
bool
, optional, defaults to True
) —
Controls whether to resize the image’s (height, width) dimensions to the specified size
. Can be overriden
by do_resize
in the preprocess
method. Dict[str, int]
optional, defaults to {"shortest_edge" -- 384}
):
Resolution of the output image after resize
is applied. If size["shortest_edge"]
>= 384, the image is
resized to (size["shortest_edge"], size["shortest_edge"])
. Otherwise, the smaller edge of the image will
be matched to int(size["shortest_edge"]/crop_pct)
, after which the image is cropped to
(size["shortest_edge"], size["shortest_edge"])
. Only has an effect if do_resize
is set to True
. Can
be overriden by size
in the preprocess
method. float
optional, defaults to 224 / 256) —
Percentage of the image to crop. Only has an effect if do_resize
is True
and size < 384. Can be
overriden by crop_pct
in the preprocess
method. PILImageResampling
, optional, defaults to Resampling.BILINEAR
) —
Resampling filter to use if resizing the image. Can be overriden by resample
in the preprocess
method. bool
, optional, defaults to True
) —
Whether to rescale the image by the specified scale rescale_factor
. Can be overriden by do_rescale
in
the preprocess
method. int
or float
, optional, defaults to 1/255
) —
Scale factor to use if rescaling the image. Can be overriden by rescale_factor
in the preprocess
method. bool
, optional, defaults to True
) —
Whether to normalize the image. Can be overridden by the do_normalize
parameter in the preprocess
method. float
or List[float]
, optional, defaults to IMAGENET_STANDARD_MEAN
) —
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the image_mean
parameter in the preprocess
method. float
or List[float]
, optional, defaults to IMAGENET_STANDARD_STD
) —
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the image_std
parameter in the preprocess
method. Constructs a ConvNeXT image processor.
( images: Union do_resize: bool = None size: Dict = None crop_pct: float = None resample: Resampling = None do_rescale: bool = None rescale_factor: float = None do_normalize: bool = None image_mean: Union = None image_std: Union = None return_tensors: Union = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None **kwargs )
Parameters
ImageInput
) —
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set do_rescale=False
. bool
, optional, defaults to self.do_resize
) —
Whether to resize the image. Dict[str, int]
, optional, defaults to self.size
) —
Size of the output image after resize
has been applied. If size["shortest_edge"]
>= 384, the image
is resized to (size["shortest_edge"], size["shortest_edge"])
. Otherwise, the smaller edge of the
image will be matched to int(size["shortest_edge"]/ crop_pct)
, after which the image is cropped to
(size["shortest_edge"], size["shortest_edge"])
. Only has an effect if do_resize
is set to True
. float
, optional, defaults to self.crop_pct
) —
Percentage of the image to crop if size < 384. int
, optional, defaults to self.resample
) —
Resampling filter to use if resizing the image. This can be one of PILImageResampling
, filters. Only
has an effect if do_resize
is set to True
. bool
, optional, defaults to self.do_rescale
) —
Whether to rescale the image values between [0 - 1]. float
, optional, defaults to self.rescale_factor
) —
Rescale factor to rescale the image by if do_rescale
is set to True
. bool
, optional, defaults to self.do_normalize
) —
Whether to normalize the image. float
or List[float]
, optional, defaults to self.image_mean
) —
Image mean. float
or List[float]
, optional, defaults to self.image_std
) —
Image standard deviation. str
or TensorType
, optional) —
The type of tensors to return. Can be one of:np.ndarray
.TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) —
The channel dimension format for the output image. Can be one of:"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.ChannelDimension
or str
, optional) —
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format."none"
or ChannelDimension.NONE
: image in (height, width) format.Preprocess an image or batch of images.
( config )
Parameters
The bare ConvNext model outputting raw features without any specific head on top. This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( pixel_values: FloatTensor = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention
or tuple(torch.FloatTensor)
Parameters
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ConvNextImageProcessor.call() for details. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention
or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ConvNextConfig) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor
of shape (batch_size, hidden_size)
) — Last layer hidden-state after a pooling operation on the spatial dimensions.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, num_channels, height, width)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
The ConvNextModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, ConvNextModel
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-tiny-224")
>>> model = ConvNextModel.from_pretrained("facebook/convnext-tiny-224")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 768, 7, 7]
( config )
Parameters
ConvNext Model with an image classification head on top (a linear layer on top of the pooled features), e.g. for ImageNet.
This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( pixel_values: FloatTensor = None labels: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.ImageClassifierOutputWithNoAttention or tuple(torch.FloatTensor)
Parameters
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ConvNextImageProcessor.call() for details. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size,)
, optional) —
Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If
config.num_labels > 1
a classification loss is computed (Cross-Entropy). Returns
transformers.modeling_outputs.ImageClassifierOutputWithNoAttention or tuple(torch.FloatTensor)
A transformers.modeling_outputs.ImageClassifierOutputWithNoAttention or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ConvNextConfig) and inputs.
torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each stage) of shape (batch_size, num_channels, height, width)
. Hidden-states (also
called feature maps) of the model at the output of each stage.The ConvNextForImageClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, ConvNextForImageClassification
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-tiny-224")
>>> model = ConvNextForImageClassification.from_pretrained("facebook/convnext-tiny-224")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
tabby, tabby cat
( config *inputs add_pooling_layer = True **kwargs )
Parameters
The bare ConvNext model outputting raw features without any specific head on top. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just
pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second
format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with
the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first
positional argument:
pixel_values
only and nothing else: model(pixel_values)
model([pixel_values, attention_mask])
or model([pixel_values, attention_mask, token_type_ids])
model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
( pixel_values: TFModelInputType | None = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)
Parameters
np.ndarray
, tf.Tensor
, List[tf.Tensor]
`Dict[str, tf.Tensor]
or Dict[str, np.ndarray]
and each example must have the shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ConvNextImageProcessor.call() for details. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. This argument can be used only in eager mode, in graph mode the value in the config will be
used instead. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. This argument can be used in
eager mode, in graph mode the value will always be set to True. Returns
transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (ConvNextConfig) and inputs.
last_hidden_state (tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (tf.Tensor
of shape (batch_size, hidden_size)
) — Last layer hidden-state of the first token of the sequence (classification token) further processed by a
Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence
prediction (classification) objective during pretraining.
This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape
(batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFConvNextModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoImageProcessor, TFConvNextModel
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-tiny-224")
>>> model = TFConvNextModel.from_pretrained("facebook/convnext-tiny-224")
>>> inputs = image_processor(images=image, return_tensors="tf")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
( config: ConvNextConfig *inputs **kwargs )
Parameters
ConvNext Model with an image classification head on top (a linear layer on top of the pooled features), e.g. for ImageNet.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just
pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second
format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with
the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first
positional argument:
pixel_values
only and nothing else: model(pixel_values)
model([pixel_values, attention_mask])
or model([pixel_values, attention_mask, token_type_ids])
model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
( pixel_values: TFModelInputType | None = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None labels: np.ndarray | tf.Tensor | None = None training: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)
Parameters
np.ndarray
, tf.Tensor
, List[tf.Tensor]
`Dict[str, tf.Tensor]
or Dict[str, np.ndarray]
and each example must have the shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ConvNextImageProcessor.call() for details. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. This argument can be used only in eager mode, in graph mode the value in the config will be
used instead. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. This argument can be used in
eager mode, in graph mode the value will always be set to True. tf.Tensor
or np.ndarray
of shape (batch_size,)
, optional) —
Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If
config.num_labels > 1
a classification loss is computed (Cross-Entropy). Returns
transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFSequenceClassifierOutput or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (ConvNextConfig) and inputs.
loss (tf.Tensor
of shape (batch_size, )
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (tf.Tensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape
(batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFConvNextForImageClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoImageProcessor, TFConvNextForImageClassification
>>> import tensorflow as tf
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-tiny-224")
>>> model = TFConvNextForImageClassification.from_pretrained("facebook/convnext-tiny-224")
>>> inputs = image_processor(images=image, return_tensors="tf")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = tf.math.argmax(logits, axis=-1)[0]
>>> print("Predicted class:", model.config.id2label[int(predicted_class_idx)])