The ViLT model was proposed in ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Wonjae Kim, Bokyung Son, Ildoo Kim. ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for Vision-and-Language Pre-training (VLP).
The abstract from the paper is the following:
Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance.
This model was contributed by nielsr. The original code can be found here.
pixel_values
and input_ids
as input. One can use ViltProcessor to prepare data for the model.
This processor wraps a image processor (for the image modality) and a tokenizer (for the language modality) into one.pixel_mask
that indicates
which pixel values are real and which are padding. ViltProcessor automatically creates this for you.( vocab_size = 30522 type_vocab_size = 2 modality_type_vocab_size = 2 max_position_embeddings = 40 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-12 image_size = 384 patch_size = 32 num_channels = 3 qkv_bias = True max_image_length = -1 tie_word_embeddings = False num_images = -1 **kwargs )
Parameters
int
, optional, defaults to 30522) —
Vocabulary size of the text part of the model. Defines the number of different tokens that can be
represented by the inputs_ids
passed when calling ViltModel. int
, optional, defaults to 2) —
The vocabulary size of the token_type_ids
passed when calling ViltModel. This is used when encoding
text. int
, optional, defaults to 2) —
The vocabulary size of the modalities passed when calling ViltModel. This is used after concatening the
embeddings of the text and image modalities. int
, optional, defaults to 40) —
The maximum sequence length that this model might ever be used with. int
, optional, defaults to 768) —
Dimensionality of the encoder layers and the pooler layer. int
, optional, defaults to 12) —
Number of hidden layers in the Transformer encoder. int
, optional, defaults to 12) —
Number of attention heads for each attention layer in the Transformer encoder. int
, optional, defaults to 3072) —
Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder. str
or function
, optional, defaults to "gelu"
) —
The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
,
"relu"
, "selu"
and "gelu_new"
are supported. float
, optional, defaults to 0.0) —
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. float
, optional, defaults to 0.0) —
The dropout ratio for the attention probabilities. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-12) —
The epsilon used by the layer normalization layers. int
, optional, defaults to 384) —
The size (resolution) of each image. int
, optional, defaults to 32) —
The size (resolution) of each patch. int
, optional, defaults to 3) —
The number of input channels. bool
, optional, defaults to True
) —
Whether to add a bias to the queries, keys and values. int
, optional, defaults to -1) —
The maximum number of patches to take as input for the Transformer encoder. If set to a positive integer,
the encoder will sample max_image_length
patches at maximum. If set to -1, will not be taken into
account. int
, optional, defaults to -1) —
The number of images to use for natural language visual reasoning. If set to a positive integer, will be
used by ViltForImagesAndTextClassification for defining the classifier head. This is the configuration class to store the configuration of a ViLTModel
. It is used to instantiate an ViLT
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the ViLT
dandelin/vilt-b32-mlm architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import ViLTModel, ViLTConfig
>>> # Initializing a ViLT dandelin/vilt-b32-mlm style configuration
>>> configuration = ViLTConfig()
>>> # Initializing a model from the dandelin/vilt-b32-mlm style configuration
>>> model = ViLTModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Preprocess an image or a batch of images.
( do_resize: bool = True size: Dict = None size_divisor: int = 32 resample: Resampling = <Resampling.BICUBIC: 3> do_rescale: bool = True rescale_factor: Union = 0.00392156862745098 do_normalize: bool = True image_mean: Union = None image_std: Union = None do_pad: bool = True **kwargs )
Parameters
bool
, optional, defaults to True
) —
Whether to resize the image’s (height, width) dimensions to the specified size
. Can be overridden by the
do_resize
parameter in the preprocess
method. Dict[str, int]
optional, defaults to {"shortest_edge" -- 384}
):
Resize the shorter side of the input to size["shortest_edge"]
. The longer side will be limited to under
int((1333 / 800) * size["shortest_edge"])
while preserving the aspect ratio. Only has an effect if
do_resize
is set to True
. Can be overridden by the size
parameter in the preprocess
method. int
, optional, defaults to 32) —
The size by which to make sure both the height and width can be divided. Only has an effect if do_resize
is set to True
. Can be overridden by the size_divisor
parameter in the preprocess
method. PILImageResampling
, optional, defaults to Resampling.BICUBIC
) —
Resampling filter to use if resizing the image. Only has an effect if do_resize
is set to True
. Can be
overridden by the resample
parameter in the preprocess
method. bool
, optional, defaults to True
) —
Wwhether to rescale the image by the specified scale rescale_factor
. Can be overridden by the
do_rescale
parameter in the preprocess
method. int
or float
, optional, defaults to 1/255
) —
Scale factor to use if rescaling the image. Only has an effect if do_rescale
is set to True
. Can be
overridden by the rescale_factor
parameter in the preprocess
method. bool
, optional, defaults to True
) —
Whether to normalize the image. Can be overridden by the do_normalize
parameter in the preprocess
method. Can be overridden by the do_normalize
parameter in the preprocess
method. float
or List[float]
, optional, defaults to IMAGENET_STANDARD_MEAN
) —
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the image_mean
parameter in the preprocess
method. Can be
overridden by the image_mean
parameter in the preprocess
method. float
or List[float]
, optional, defaults to IMAGENET_STANDARD_STD
) —
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the image_std
parameter in the preprocess
method.
Can be overridden by the image_std
parameter in the preprocess
method. bool
, optional, defaults to True
) —
Whether to pad the image to the (max_height, max_width)
of the images in the batch. Can be overridden by
the do_pad
parameter in the preprocess
method. Constructs a ViLT image processor.
( images: Union do_resize: Optional = None size: Optional = None size_divisor: Optional = None resample: Resampling = None do_rescale: Optional = None rescale_factor: Optional = None do_normalize: Optional = None image_mean: Union = None image_std: Union = None do_pad: Optional = None return_tensors: Union = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None **kwargs )
Parameters
ImageInput
) —
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set do_rescale=False
. bool
, optional, defaults to self.do_resize
) —
Whether to resize the image. Dict[str, int]
, optional, defaults to self.size
) —
Controls the size of the image after resize
. The shortest edge of the image is resized to
size["shortest_edge"]
whilst preserving the aspect ratio. If the longest edge of this resized image
is > int(size["shortest_edge"] * (1333 / 800))
, then the image is resized again to make the longest
edge equal to int(size["shortest_edge"] * (1333 / 800))
. int
, optional, defaults to self.size_divisor
) —
The image is resized to a size that is a multiple of this value. PILImageResampling
, optional, defaults to self.resample
) —
Resampling filter to use if resizing the image. Only has an effect if do_resize
is set to True
. bool
, optional, defaults to self.do_rescale
) —
Whether to rescale the image values between [0 - 1]. float
, optional, defaults to self.rescale_factor
) —
Rescale factor to rescale the image by if do_rescale
is set to True
. bool
, optional, defaults to self.do_normalize
) —
Whether to normalize the image. float
or List[float]
, optional, defaults to self.image_mean
) —
Image mean to normalize the image by if do_normalize
is set to True
. float
or List[float]
, optional, defaults to self.image_std
) —
Image standard deviation to normalize the image by if do_normalize
is set to True
. bool
, optional, defaults to self.do_pad
) —
Whether to pad the image to the (max_height, max_width) in the batch. If True
, a pixel mask is also
created and returned. str
or TensorType
, optional) —
The type of tensors to return. Can be one of:np.ndarray
.TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) —
The channel dimension format for the output image. Can be one of:ChannelDimension.FIRST
: image in (num_channels, height, width) format.ChannelDimension.LAST
: image in (height, width, num_channels) format.ChannelDimension
or str
, optional) —
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format."none"
or ChannelDimension.NONE
: image in (height, width) format.Preprocess an image or batch of images.
( image_processor = None tokenizer = None **kwargs )
Parameters
ViltImageProcessor
, optional) —
An instance of ViltImageProcessor. The image processor is a required input. BertTokenizerFast
, optional) —
An instance of [‘BertTokenizerFast`]. The tokenizer is a required input. Constructs a ViLT processor which wraps a BERT tokenizer and ViLT image processor into a single processor.
ViltProcessor offers all the functionalities of ViltImageProcessor and BertTokenizerFast. See the
docstring of call() and decode()
for more information.
( images text: Union = None add_special_tokens: bool = True padding: Union = False truncation: Union = None max_length: Optional = None stride: int = 0 pad_to_multiple_of: Optional = None return_token_type_ids: Optional = None return_attention_mask: Optional = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True return_tensors: Union = None **kwargs )
This method uses ViltImageProcessor.call() method to prepare image(s) for the model, and BertTokenizerFast.call() to prepare text for the model.
Please refer to the docstring of the above two methods for more information.
( config add_pooling_layer = True )
Parameters
The bare ViLT Model transformer outputting raw hidden-states without any specific head on top.
This model is a PyTorch torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>
_ subclass. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None pixel_values: Optional = None pixel_mask: Optional = None head_mask: Optional = None inputs_embeds: Optional = None image_embeds: Optional = None image_token_type_idx: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape ({0})
) —
Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See
PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input
IDs? torch.FloatTensor
of shape ({0})
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape ({0})
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ViltImageProcessor.call() for details. torch.LongTensor
of shape (batch_size, height, width)
, optional) —
Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]
:
What are attention masks? <../glossary.html#attention-mask>
__torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape ({0}, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, num_patches, hidden_size)
, optional) —
Optionally, instead of passing pixel_values
, you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert pixel_values
into patch embeddings. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ViltConfig) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor
of shape (batch_size, hidden_size)
) — Last layer hidden-state of the first token of the sequence (classification token) after further processing
through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
the classification token after processing through a linear layer and a tanh activation function. The linear
layer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ViltModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import ViltProcessor, ViltModel
>>> from PIL import Image
>>> import requests
>>> # prepare image and text
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> text = "hello world"
>>> processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-mlm")
>>> model = ViltModel.from_pretrained("dandelin/vilt-b32-mlm")
>>> inputs = processor(image, text, return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
( config )
Parameters
ViLT Model with a language modeling head on top as done during pretraining.
This model is a PyTorch torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>
_ subclass. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None pixel_values: Optional = None pixel_mask: Optional = None head_mask: Optional = None inputs_embeds: Optional = None image_embeds: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.MaskedLMOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See
PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input
IDs? torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ViltImageProcessor.call() for details. torch.LongTensor
of shape (batch_size, height, width)
, optional) —
Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]
:
What are attention masks? <../glossary.html#attention-mask>
__torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, num_patches, hidden_size)
, optional) —
Optionally, instead of passing pixel_values
, you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert pixel_values
into patch embeddings. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.MaskedLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.MaskedLMOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ViltConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Masked language modeling (MLM) loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ViltForMaskedLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import ViltProcessor, ViltForMaskedLM
>>> import requests
>>> from PIL import Image
>>> import re
>>> import torch
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> text = "a bunch of [MASK] laying on a [MASK]."
>>> processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-mlm")
>>> model = ViltForMaskedLM.from_pretrained("dandelin/vilt-b32-mlm")
>>> # prepare inputs
>>> encoding = processor(image, text, return_tensors="pt")
>>> # forward pass
>>> outputs = model(**encoding)
>>> tl = len(re.findall("\[MASK\]", text))
>>> inferred_token = [text]
>>> # gradually fill in the MASK tokens, one by one
>>> with torch.no_grad():
... for i in range(tl):
... encoded = processor.tokenizer(inferred_token)
... input_ids = torch.tensor(encoded.input_ids)
... encoded = encoded["input_ids"][0][1:-1]
... outputs = model(input_ids=input_ids, pixel_values=encoding.pixel_values)
... mlm_logits = outputs.logits[0] # shape (seq_len, vocab_size)
... # only take into account text features (minus CLS and SEP token)
... mlm_logits = mlm_logits[1 : input_ids.shape[1] - 1, :]
... mlm_values, mlm_ids = mlm_logits.softmax(dim=-1).max(dim=-1)
... # only take into account text
... mlm_values[torch.tensor(encoded) != 103] = 0
... select = mlm_values.argmax().item()
... encoded[select] = mlm_ids[select].item()
... inferred_token = [processor.decode(encoded)]
>>> selected_token = ""
>>> encoded = processor.tokenizer(inferred_token)
>>> output = processor.decode(encoded.input_ids[0], skip_special_tokens=True)
>>> print(output)
a bunch of cats laying on a couch.
( config )
Parameters
Vilt Model transformer with a classifier head on top (a linear layer on top of the final hidden state of the [CLS] token) for visual question answering, e.g. for VQAv2.
This model is a PyTorch torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>
_ subclass. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None pixel_values: Optional = None pixel_mask: Optional = None head_mask: Optional = None inputs_embeds: Optional = None image_embeds: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape ({0})
) —
Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See
PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input
IDs? torch.FloatTensor
of shape ({0})
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape ({0})
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ViltImageProcessor.call() for details. torch.LongTensor
of shape (batch_size, height, width)
, optional) —
Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]
:
What are attention masks? <../glossary.html#attention-mask>
__torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape ({0}, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, num_patches, hidden_size)
, optional) —
Optionally, instead of passing pixel_values
, you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert pixel_values
into patch embeddings. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.FloatTensor
of shape (batch_size, num_labels)
, optional) —
Labels for computing the visual question answering loss. This tensor must be either a one-hot encoding of
all answers that are applicable for a given example in the batch, or a soft encoding indicating which
answers are applicable, where 1.0 is the highest score. Returns
transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ViltConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ViltForQuestionAnswering forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import ViltProcessor, ViltForQuestionAnswering
>>> import requests
>>> from PIL import Image
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> text = "How many cats are there?"
>>> processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
>>> model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
>>> # prepare inputs
>>> encoding = processor(image, text, return_tensors="pt")
>>> # forward pass
>>> outputs = model(**encoding)
>>> logits = outputs.logits
>>> idx = logits.argmax(-1).item()
>>> print("Predicted answer:", model.config.id2label[idx])
Predicted answer: 2
( config )
Parameters
torch.LongTensor
of shape ({0})
) —
Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See
PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input
IDs? torch.FloatTensor
of shape ({0})
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape ({0})
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, num_images, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ViltImageProcessor.call() for details. torch.LongTensor
of shape (batch_size, num_images, height, width)
, optional) —
Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]
:
What are attention masks? <../glossary.html#attention-mask>
__torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape ({0}, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, num_images, num_patches, hidden_size)
, optional) —
Optionally, instead of passing pixel_values
, you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert pixel_values
into patch embeddings. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Vilt Model transformer with a classifier head on top for natural language visual reasoning, e.g. NLVR2.
( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None pixel_values: Optional = None pixel_mask: Optional = None head_mask: Optional = None inputs_embeds: Optional = None image_embeds: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.models.vilt.modeling_vilt.ViltForImagesAndTextClassificationOutput
or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape ({0})
) —
Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See
PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input
IDs? torch.FloatTensor
of shape ({0})
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape ({0})
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ViltImageProcessor.call() for details. torch.LongTensor
of shape (batch_size, height, width)
, optional) —
Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]
:
What are attention masks? <../glossary.html#attention-mask>
__torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape ({0}, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, num_patches, hidden_size)
, optional) —
Optionally, instead of passing pixel_values
, you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert pixel_values
into patch embeddings. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size,)
, optional) —
Binary classification labels. Returns
transformers.models.vilt.modeling_vilt.ViltForImagesAndTextClassificationOutput
or tuple(torch.FloatTensor)
A transformers.models.vilt.modeling_vilt.ViltForImagesAndTextClassificationOutput
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ViltConfig) and inputs.
torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).List[tuple(torch.FloatTensor)]
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — List of tuples of torch.FloatTensor
(one for each image-text pair, each tuple containing the output of
the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.List[tuple(torch.FloatTensor)]
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — List of tuples of torch.FloatTensor
(one for each image-text pair, each tuple containing the attention
weights of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the
attention softmax, used to compute the weighted average in the self-attention heads.The ViltForImagesAndTextClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import ViltProcessor, ViltForImagesAndTextClassification
>>> import requests
>>> from PIL import Image
>>> image1 = Image.open(requests.get("https://lil.nlp.cornell.edu/nlvr/exs/ex0_0.jpg", stream=True).raw)
>>> image2 = Image.open(requests.get("https://lil.nlp.cornell.edu/nlvr/exs/ex0_1.jpg", stream=True).raw)
>>> text = "The left image contains twice the number of dogs as the right image."
>>> processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-nlvr2")
>>> model = ViltForImagesAndTextClassification.from_pretrained("dandelin/vilt-b32-finetuned-nlvr2")
>>> # prepare inputs
>>> encoding = processor([image1, image2], text, return_tensors="pt")
>>> # forward pass
>>> outputs = model(input_ids=encoding.input_ids, pixel_values=encoding.pixel_values.unsqueeze(0))
>>> logits = outputs.logits
>>> idx = logits.argmax(-1).item()
>>> print("Predicted answer:", model.config.id2label[idx])
Predicted answer: True
( config )
Parameters
Vilt Model transformer with a classifier head on top (a linear layer on top of the final hidden state of the [CLS] token) for image-to-text or text-to-image retrieval, e.g. MSCOCO and F30K.
This model is a PyTorch torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>
_ subclass. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None pixel_values: Optional = None pixel_mask: Optional = None head_mask: Optional = None inputs_embeds: Optional = None image_embeds: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape ({0})
) —
Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See
PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input
IDs? torch.FloatTensor
of shape ({0})
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape ({0})
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ViltImageProcessor.call() for details. torch.LongTensor
of shape (batch_size, height, width)
, optional) —
Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]
:
What are attention masks? <../glossary.html#attention-mask>
__torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape ({0}, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, num_patches, hidden_size)
, optional) —
Optionally, instead of passing pixel_values
, you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert pixel_values
into patch embeddings. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size,)
, optional) —
Labels are currently not supported. Returns
transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ViltConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ViltForImageAndTextRetrieval forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import ViltProcessor, ViltForImageAndTextRetrieval
>>> import requests
>>> from PIL import Image
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
>>> processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-coco")
>>> model = ViltForImageAndTextRetrieval.from_pretrained("dandelin/vilt-b32-finetuned-coco")
>>> # forward pass
>>> scores = dict()
>>> for text in texts:
... # prepare inputs
... encoding = processor(image, text, return_tensors="pt")
... outputs = model(**encoding)
... scores[text] = outputs.logits[0, :].item()
( config )
Parameters
ViLT Model with a token classification head on top (a linear layer on top of the final hidden-states of the text tokens) e.g. for Named-Entity-Recognition (NER) tasks.
This model is a PyTorch torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>
_ subclass. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None pixel_values: Optional = None pixel_mask: Optional = None head_mask: Optional = None inputs_embeds: Optional = None image_embeds: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape ({0})
) —
Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See
PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input
IDs? torch.FloatTensor
of shape ({0})
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape ({0})
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Pixel values can be obtained using AutoImageProcessor. See
ViltImageProcessor.call() for details. torch.LongTensor
of shape (batch_size, height, width)
, optional) —
Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]
:
What are attention masks? <../glossary.html#attention-mask>
__torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape ({0}, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, num_patches, hidden_size)
, optional) —
Optionally, instead of passing pixel_values
, you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert pixel_values
into patch embeddings. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size, text_sequence_length)
, optional) —
Labels for computing the token classification loss. Indices should be in [0, ..., config.num_labels - 1]
. Returns
transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.TokenClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ViltConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ViltForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.