The LayoutLMv3 model was proposed in LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3 simplifies LayoutLMv2 by using patch embeddings (as in ViT) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA).
The abstract from the paper is the following:
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.
LayoutLMv3 architecture. Taken from the original paper.This model was contributed by nielsr. The TensorFlow version of this model was added by chriskoo, tokec, and lre. The original code can be found here.
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
LayoutLMv3 is nearly identical to LayoutLMv2, so we’ve also included LayoutLMv2 resources you can adapt for LayoutLMv3 tasks. For these notebooks, take care to use LayoutLMv2Processor instead when preparing data for the model!
Document question answering
( vocab_size = 50265 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-05 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 max_2d_position_embeddings = 1024 coordinate_size = 128 shape_size = 128 has_relative_attention_bias = True rel_pos_bins = 32 max_rel_pos = 128 rel_2d_pos_bins = 64 max_rel_2d_pos = 256 has_spatial_attention_bias = True text_embed = True visual_embed = True input_size = 224 num_channels = 3 patch_size = 16 classifier_dropout = None **kwargs )
Parameters
int
, optional, defaults to 50265) —
Vocabulary size of the LayoutLMv3 model. Defines the number of different tokens that can be represented by
the inputs_ids
passed when calling LayoutLMv3Model. int
, optional, defaults to 768) —
Dimension of the encoder layers and the pooler layer. int
, optional, defaults to 12) —
Number of hidden layers in the Transformer encoder. int
, optional, defaults to 12) —
Number of attention heads for each attention layer in the Transformer encoder. int
, optional, defaults to 3072) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder. str
or function
, optional, defaults to "gelu"
) —
The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
,
"relu"
, "selu"
and "gelu_new"
are supported. float
, optional, defaults to 0.1) —
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. float
, optional, defaults to 0.1) —
The dropout ratio for the attention probabilities. int
, optional, defaults to 512) —
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048). int
, optional, defaults to 2) —
The vocabulary size of the token_type_ids
passed when calling LayoutLMv3Model. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-5) —
The epsilon used by the layer normalization layers. int
, optional, defaults to 1024) —
The maximum value that the 2D position embedding might ever be used with. Typically set this to something
large just in case (e.g., 1024). int
, optional, defaults to 128
) —
Dimension of the coordinate embeddings. int
, optional, defaults to 128
) —
Dimension of the width and height embeddings. bool
, optional, defaults to True
) —
Whether or not to use a relative attention bias in the self-attention mechanism. int
, optional, defaults to 32) —
The number of relative position bins to be used in the self-attention mechanism. int
, optional, defaults to 128) —
The maximum number of relative positions to be used in the self-attention mechanism. int
, optional, defaults to 256) —
The maximum number of relative 2D positions in the self-attention mechanism. int
, optional, defaults to 64) —
The number of 2D relative position bins in the self-attention mechanism. bool
, optional, defaults to True
) —
Whether or not to use a spatial attention bias in the self-attention mechanism. bool
, optional, defaults to True
) —
Whether or not to add patch embeddings. int
, optional, defaults to 224
) —
The size (resolution) of the images. int
, optional, defaults to 3
) —
The number of channels of the images. int
, optional, defaults to 16
) —
The size (resolution) of the patches. float
, optional) —
The dropout ratio for the classification head. This is the configuration class to store the configuration of a LayoutLMv3Model. It is used to instantiate an LayoutLMv3 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LayoutLMv3 microsoft/layoutlmv3-base architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import LayoutLMv3Config, LayoutLMv3Model
>>> # Initializing a LayoutLMv3 microsoft/layoutlmv3-base style configuration
>>> configuration = LayoutLMv3Config()
>>> # Initializing a model (with random weights) from the microsoft/layoutlmv3-base style configuration
>>> model = LayoutLMv3Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Preprocess an image or a batch of images.
( do_resize: bool = True size: Dict = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_value: float = 0.00392156862745098 do_normalize: bool = True image_mean: Union = None image_std: Union = None apply_ocr: bool = True ocr_lang: Optional = None tesseract_config: Optional = '' **kwargs )
Parameters
bool
, optional, defaults to True
) —
Whether to resize the image’s (height, width) dimensions to (size["height"], size["width"])
. Can be
overridden by do_resize
in preprocess
. Dict[str, int]
optional, defaults to {"height" -- 224, "width": 224}
):
Size of the image after resizing. Can be overridden by size
in preprocess
. PILImageResampling
, optional, defaults to PILImageResampling.BILINEAR
) —
Resampling filter to use if resizing the image. Can be overridden by resample
in preprocess
. bool
, optional, defaults to True
) —
Whether to rescale the image’s pixel values by the specified rescale_value
. Can be overridden by
do_rescale
in preprocess
. float
, optional, defaults to 1 / 255) —
Value by which the image’s pixel values are rescaled. Can be overridden by rescale_factor
in
preprocess
. bool
, optional, defaults to True
) —
Whether to normalize the image. Can be overridden by the do_normalize
parameter in the preprocess
method. Iterable[float]
or float
, optional, defaults to IMAGENET_STANDARD_MEAN
) —
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the image_mean
parameter in the preprocess
method. Iterable[float]
or float
, optional, defaults to IMAGENET_STANDARD_STD
) —
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the image_std
parameter in the preprocess
method. bool
, optional, defaults to True
) —
Whether to apply the Tesseract OCR engine to get words + normalized bounding boxes. Can be overridden by
the apply_ocr
parameter in the preprocess
method. str
, optional) —
The language, specified by its ISO code, to be used by the Tesseract OCR engine. By default, English is
used. Can be overridden by the ocr_lang
parameter in the preprocess
method. str
, optional) —
Any additional custom configuration flags that are forwarded to the config
parameter when calling
Tesseract. For example: ‘—psm 6’. Can be overridden by the tesseract_config
parameter in the
preprocess
method. Constructs a LayoutLMv3 image processor.
( images: Union do_resize: bool = None size: Dict = None resample = None do_rescale: bool = None rescale_factor: float = None do_normalize: bool = None image_mean: Union = None image_std: Union = None apply_ocr: bool = None ocr_lang: Optional = None tesseract_config: Optional = None return_tensors: Union = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None **kwargs )
Parameters
ImageInput
) —
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set do_rescale=False
. bool
, optional, defaults to self.do_resize
) —
Whether to resize the image. Dict[str, int]
, optional, defaults to self.size
) —
Desired size of the output image after applying resize
. int
, optional, defaults to self.resample
) —
Resampling filter to use if resizing the image. This can be one of the PILImageResampling
filters.
Only has an effect if do_resize
is set to True
. bool
, optional, defaults to self.do_rescale
) —
Whether to rescale the image pixel values between [0, 1]. float
, optional, defaults to self.rescale_factor
) —
Rescale factor to apply to the image pixel values. Only has an effect if do_rescale
is set to True
. bool
, optional, defaults to self.do_normalize
) —
Whether to normalize the image. float
or Iterable[float]
, optional, defaults to self.image_mean
) —
Mean values to be used for normalization. Only has an effect if do_normalize
is set to True
. float
or Iterable[float]
, optional, defaults to self.image_std
) —
Standard deviation values to be used for normalization. Only has an effect if do_normalize
is set to
True
. bool
, optional, defaults to self.apply_ocr
) —
Whether to apply the Tesseract OCR engine to get words + normalized bounding boxes. str
, optional, defaults to self.ocr_lang
) —
The language, specified by its ISO code, to be used by the Tesseract OCR engine. By default, English is
used. str
, optional, defaults to self.tesseract_config
) —
Any additional custom configuration flags that are forwarded to the config
parameter when calling
Tesseract. str
or TensorType
, optional) —
The type of tensors to return. Can be one of:np.ndarray
.TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) —
The channel dimension format for the output image. Can be one of:ChannelDimension.FIRST
: image in (num_channels, height, width) format.ChannelDimension.LAST
: image in (height, width, num_channels) format.ChannelDimension
or str
, optional) —
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format."none"
or ChannelDimension.NONE
: image in (height, width) format.Preprocess an image or batch of images.
( vocab_file merges_file errors = 'replace' bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True cls_token_box = [0, 0, 0, 0] sep_token_box = [0, 0, 0, 0] pad_token_box = [0, 0, 0, 0] pad_token_label = -100 only_label_first_subword = True **kwargs )
Parameters
str
) —
Path to the vocabulary file. str
) —
Path to the merges file. str
, optional, defaults to "replace"
) —
Paradigm to follow when decoding bytes to UTF-8. See
bytes.decode for more information. str
, optional, defaults to "<s>"
) —
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the cls_token
.
str
, optional, defaults to "</s>"
) —
The end of sequence token.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the sep_token
.
str
, optional, defaults to "</s>"
) —
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens. str
, optional, defaults to "<s>"
) —
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens. str
, optional, defaults to "<unk>"
) —
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. str
, optional, defaults to "<pad>"
) —
The token used for padding, for example when batching sequences of different lengths. str
, optional, defaults to "<mask>"
) —
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict. bool
, optional, defaults to True
) —
Whether or not to add an initial space to the input. This allows to treat the leading word just as any
other word. (RoBERTa tokenizer detect beginning of words by the preceding space). List[int]
, optional, defaults to [0, 0, 0, 0]
) —
The bounding box to use for the special [CLS] token. List[int]
, optional, defaults to [0, 0, 0, 0]
) —
The bounding box to use for the special [SEP] token. List[int]
, optional, defaults to [0, 0, 0, 0]
) —
The bounding box to use for the special [PAD] token. int
, optional, defaults to -100) —
The label to use for padding tokens. Defaults to -100, which is the ignore_index
of PyTorch’s
CrossEntropyLoss. bool
, optional, defaults to True
) —
Whether or not to only label the first subword, in case word labels are provided. Construct a LayoutLMv3 tokenizer. Based on RoBERTatokenizer
(Byte Pair Encoding or BPE).
LayoutLMv3Tokenizer can be used to turn words, word-level bounding boxes and optional word labels to
token-level input_ids
, attention_mask
, token_type_ids
, bbox
, and optional labels
(for token
classification).
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
LayoutLMv3Tokenizer runs end-to-end tokenization: punctuation splitting and wordpiece. It also turns the word-level bounding boxes into token-level bounding boxes.
( text: Union text_pair: Union = None boxes: Union = None word_labels: Union = None add_special_tokens: bool = True padding: Union = False truncation: Union = None max_length: Optional = None stride: int = 0 pad_to_multiple_of: Optional = None return_tensors: Union = None return_token_type_ids: Optional = None return_attention_mask: Optional = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs )
Parameters
str
, List[str]
, List[List[str]]
) —
The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings
(words of a single example or questions of a batch of examples) or a list of list of strings (batch of
words). List[str]
, List[List[str]]
) —
The sequence or batch of sequences to be encoded. Each sequence should be a list of strings
(pretokenized string). List[List[int]]
, List[List[List[int]]]
) —
Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale. List[int]
, List[List[int]]
, optional) —
Word-level integer labels (for token classification tasks such as FUNSD, CORD). bool
, optional, defaults to True
) —
Whether or not to encode the sequences with the special tokens relative to their model. bool
, str
or PaddingStrategy, optional, defaults to False
) —
Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum
acceptable input length for the model if that argument is not provided.False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different
lengths).bool
, str
or TruncationStrategy, optional, defaults to False
) —
Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).int
, optional) —
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None
, this will use the predefined model maximum length if a maximum length
is required by one of the truncation/padding parameters. If the model has no specific maximum input
length (like XLNet) truncation/padding to a maximum length will be deactivated.
int
, optional, defaults to 0) —
If set to a number along with max_length
, the overflowing tokens returned when
return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens. int
, optional) —
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta). str
or TensorType, optional) —
If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.'pt'
: Return PyTorch torch.Tensor
objects.'np'
: Return Numpy np.ndarray
objects.bool
, optional, defaults to True
) —
Whether or not to encode the sequences with the special tokens relative to their model. bool
, str
or PaddingStrategy, optional, defaults to False
) —
Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum
acceptable input length for the model if that argument is not provided.False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different
lengths).bool
, str
or TruncationStrategy, optional, defaults to False
) —
Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).int
, optional) —
Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the
truncation/padding parameters. If the model has no specific maximum input length (like XLNet)
truncation/padding to a maximum length will be deactivated. int
, optional, defaults to 0) —
If set to a number along with max_length
, the overflowing tokens returned when
return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens. int
, optional) —
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta). str
or TensorType, optional) —
If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.'pt'
: Return PyTorch torch.Tensor
objects.'np'
: Return Numpy np.ndarray
objects.Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.
( vocab_file = None merges_file = None tokenizer_file = None errors = 'replace' bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True trim_offsets = True cls_token_box = [0, 0, 0, 0] sep_token_box = [0, 0, 0, 0] pad_token_box = [0, 0, 0, 0] pad_token_label = -100 only_label_first_subword = True **kwargs )
Parameters
str
) —
Path to the vocabulary file. str
) —
Path to the merges file. str
, optional, defaults to "replace"
) —
Paradigm to follow when decoding bytes to UTF-8. See
bytes.decode for more information. str
, optional, defaults to "<s>"
) —
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the cls_token
.
str
, optional, defaults to "</s>"
) —
The end of sequence token.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the sep_token
.
str
, optional, defaults to "</s>"
) —
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens. str
, optional, defaults to "<s>"
) —
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens. str
, optional, defaults to "<unk>"
) —
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. str
, optional, defaults to "<pad>"
) —
The token used for padding, for example when batching sequences of different lengths. str
, optional, defaults to "<mask>"
) —
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict. bool
, optional, defaults to False
) —
Whether or not to add an initial space to the input. This allows to treat the leading word just as any
other word. (RoBERTa tokenizer detect beginning of words by the preceding space). bool
, optional, defaults to True
) —
Whether the post processing step should trim offsets to avoid including whitespaces. List[int]
, optional, defaults to [0, 0, 0, 0]
) —
The bounding box to use for the special [CLS] token. List[int]
, optional, defaults to [0, 0, 0, 0]
) —
The bounding box to use for the special [SEP] token. List[int]
, optional, defaults to [0, 0, 0, 0]
) —
The bounding box to use for the special [PAD] token. int
, optional, defaults to -100) —
The label to use for padding tokens. Defaults to -100, which is the ignore_index
of PyTorch’s
CrossEntropyLoss. bool
, optional, defaults to True
) —
Whether or not to only label the first subword, in case word labels are provided. Construct a “fast” LayoutLMv3 tokenizer (backed by HuggingFace’s tokenizers library). Based on BPE.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
( text: Union text_pair: Union = None boxes: Union = None word_labels: Union = None add_special_tokens: bool = True padding: Union = False truncation: Union = None max_length: Optional = None stride: int = 0 pad_to_multiple_of: Optional = None return_tensors: Union = None return_token_type_ids: Optional = None return_attention_mask: Optional = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs )
Parameters
str
, List[str]
, List[List[str]]
) —
The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings
(words of a single example or questions of a batch of examples) or a list of list of strings (batch of
words). List[str]
, List[List[str]]
) —
The sequence or batch of sequences to be encoded. Each sequence should be a list of strings
(pretokenized string). List[List[int]]
, List[List[List[int]]]
) —
Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale. List[int]
, List[List[int]]
, optional) —
Word-level integer labels (for token classification tasks such as FUNSD, CORD). bool
, optional, defaults to True
) —
Whether or not to encode the sequences with the special tokens relative to their model. bool
, str
or PaddingStrategy, optional, defaults to False
) —
Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum
acceptable input length for the model if that argument is not provided.False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different
lengths).bool
, str
or TruncationStrategy, optional, defaults to False
) —
Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).int
, optional) —
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None
, this will use the predefined model maximum length if a maximum length
is required by one of the truncation/padding parameters. If the model has no specific maximum input
length (like XLNet) truncation/padding to a maximum length will be deactivated.
int
, optional, defaults to 0) —
If set to a number along with max_length
, the overflowing tokens returned when
return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens. int
, optional) —
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta). str
or TensorType, optional) —
If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.'pt'
: Return PyTorch torch.Tensor
objects.'np'
: Return Numpy np.ndarray
objects.bool
, optional, defaults to True
) —
Whether or not to encode the sequences with the special tokens relative to their model. bool
, str
or PaddingStrategy, optional, defaults to False
) —
Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum
acceptable input length for the model if that argument is not provided.False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different
lengths).bool
, str
or TruncationStrategy, optional, defaults to False
) —
Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).int
, optional) —
Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the
truncation/padding parameters. If the model has no specific maximum input length (like XLNet)
truncation/padding to a maximum length will be deactivated. int
, optional, defaults to 0) —
If set to a number along with max_length
, the overflowing tokens returned when
return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens. int
, optional) —
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta). str
or TensorType, optional) —
If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.'pt'
: Return PyTorch torch.Tensor
objects.'np'
: Return Numpy np.ndarray
objects.Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.
( image_processor = None tokenizer = None **kwargs )
Parameters
LayoutLMv3ImageProcessor
, optional) —
An instance of LayoutLMv3ImageProcessor. The image processor is a required input. LayoutLMv3Tokenizer
or LayoutLMv3TokenizerFast
, optional) —
An instance of LayoutLMv3Tokenizer or LayoutLMv3TokenizerFast. The tokenizer is a required input. Constructs a LayoutLMv3 processor which combines a LayoutLMv3 image processor and a LayoutLMv3 tokenizer into a single processor.
LayoutLMv3Processor offers all the functionalities you need to prepare data for the model.
It first uses LayoutLMv3ImageProcessor to resize and normalize document images, and optionally applies OCR to
get words and normalized bounding boxes. These are then provided to LayoutLMv3Tokenizer or
LayoutLMv3TokenizerFast, which turns the words and bounding boxes into token-level input_ids
,
attention_mask
, token_type_ids
, bbox
. Optionally, one can provide integer word_labels
, which are turned
into token-level labels
for token classification tasks (such as FUNSD, CORD).
( images text: Union = None text_pair: Union = None boxes: Union = None word_labels: Union = None add_special_tokens: bool = True padding: Union = False truncation: Union = None max_length: Optional = None stride: int = 0 pad_to_multiple_of: Optional = None return_token_type_ids: Optional = None return_attention_mask: Optional = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True return_tensors: Union = None **kwargs )
This method first forwards the images
argument to call(). In case
LayoutLMv3ImageProcessor was initialized with apply_ocr
set to True
, it passes the obtained words and
bounding boxes along with the additional arguments to call() and returns the output,
together with resized and normalized pixel_values
. In case LayoutLMv3ImageProcessor was initialized with
apply_ocr
set to False
, it passes the words (text
/`text_pair
) and boxes
specified by the user along
with the additional arguments to call() and returns the output, together with
resized and normalized pixel_values
.
Please refer to the docstring of the above two methods for more information.
( config )
Parameters
The bare LayoutLMv3 Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None bbox: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, token_sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.LongTensor
of shape (batch_size, token_sequence_length, 4)
, optional) —
Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1)
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
. torch.FloatTensor
of shape (batch_size, token_sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
torch.LongTensor
of shape (batch_size, token_sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
torch.LongTensor
of shape (batch_size, token_sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, token_sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LayoutLMv3Config) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3Model forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, AutoModel
>>> from datasets import load_dataset
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModel.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, words, boxes=boxes, return_tensors="pt")
>>> outputs = model(**encoding)
>>> last_hidden_states = outputs.last_hidden_state
( config )
Parameters
LayoutLMv3 Model with a sequence classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for document image classification tasks such as the RVL-CDIP dataset.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None bbox: Optional = None pixel_values: Optional = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.LongTensor
of shape (batch_size, sequence_length, 4)
, optional) —
Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1)
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
y1) represents the position of the lower right corner. torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
. torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, AutoModelForSequenceClassification
>>> from datasets import load_dataset
>>> import torch
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModelForSequenceClassification.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, words, boxes=boxes, return_tensors="pt")
>>> sequence_label = torch.tensor([1])
>>> outputs = model(**encoding, labels=sequence_label)
>>> loss = outputs.loss
>>> logits = outputs.logits
( config )
Parameters
LayoutLMv3 Model with a token classification head on top (a linear layer on top of the final hidden states) e.g. for sequence labeling (information extraction) tasks such as FUNSD, SROIE, CORD and Kleister-NDA.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None bbox: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None pixel_values: Optional = None ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.LongTensor
of shape (batch_size, sequence_length, 4)
, optional) —
Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1)
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
y1) represents the position of the lower right corner. torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
. torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for computing the token classification loss. Indices should be in [0, ..., config.num_labels - 1]
. Returns
transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.TokenClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, AutoModelForTokenClassification
>>> from datasets import load_dataset
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModelForTokenClassification.from_pretrained("microsoft/layoutlmv3-base", num_labels=7)
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> word_labels = example["ner_tags"]
>>> encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
>>> outputs = model(**encoding)
>>> loss = outputs.loss
>>> logits = outputs.logits
( config )
Parameters
LayoutLMv3 Model with a span classification head on top for extractive question-answering tasks such as
DocVQA (a linear layer on top of the text part of the hidden-states output to
compute span start logits
and span end logits
).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None start_positions: Optional = None end_positions: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None bbox: Optional = None pixel_values: Optional = None ) → transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.LongTensor
of shape (batch_size, sequence_length, 4)
, optional) —
Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1)
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
y1) represents the position of the lower right corner. torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
. torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size,)
, optional) —
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence
are not taken into account for computing the loss. torch.LongTensor
of shape (batch_size,)
, optional) —
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence
are not taken into account for computing the loss. Returns
transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
start_logits (torch.FloatTensor
of shape (batch_size, sequence_length)
) — Span-start scores (before SoftMax).
end_logits (torch.FloatTensor
of shape (batch_size, sequence_length)
) — Span-end scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForQuestionAnswering forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, AutoModelForQuestionAnswering
>>> from datasets import load_dataset
>>> import torch
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModelForQuestionAnswering.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> question = "what's his name?"
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])
>>> outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions)
>>> loss = outputs.loss
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits
( config *inputs **kwargs )
Parameters
The bare LayoutLMv3 Model transformer outputting raw hidden-states without any specific head on top. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just
pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second
format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with
the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first
positional argument:
input_ids
only and nothing else: model(input_ids)
model([input_ids, attention_mask])
or model([input_ids, attention_mask, token_type_ids])
model({"input_ids": input_ids, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
( input_ids: tf.Tensor | None = None bbox: tf.Tensor | None = None attention_mask: tf.Tensor | None = None token_type_ids: tf.Tensor | None = None position_ids: tf.Tensor | None = None head_mask: tf.Tensor | None = None inputs_embeds: tf.Tensor | None = None pixel_values: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutput or tuple(tf.Tensor)
Parameters
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length, 4)
, optional) —
Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1)
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
tf.Tensor
of shape (batch_size, num_channels, height, width)
) —
Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
. tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
tf.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_tf_outputs.TFBaseModelOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFBaseModelOutput or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (LayoutLMv3Config) and inputs.
last_hidden_state (tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(tf.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape
(batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFLayoutLMv3Model forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, TFAutoModel
>>> from datasets import load_dataset
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = TFAutoModel.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, words, boxes=boxes, return_tensors="tf")
>>> outputs = model(**encoding)
>>> last_hidden_states = outputs.last_hidden_state
( config: LayoutLMv3Config **kwargs )
Parameters
LayoutLMv3 Model with a sequence classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for document image classification tasks such as the RVL-CDIP dataset.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just
pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second
format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with
the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first
positional argument:
input_ids
only and nothing else: model(input_ids)
model([input_ids, attention_mask])
or model([input_ids, attention_mask, token_type_ids])
model({"input_ids": input_ids, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
( input_ids: tf.Tensor | None = None attention_mask: tf.Tensor | None = None token_type_ids: tf.Tensor | None = None position_ids: tf.Tensor | None = None head_mask: tf.Tensor | None = None inputs_embeds: tf.Tensor | None = None labels: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None bbox: tf.Tensor | None = None pixel_values: tf.Tensor | None = None training: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)
Parameters
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length, 4)
, optional) —
Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1)
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
tf.Tensor
of shape (batch_size, num_channels, height, width)
) —
Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
. tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
tf.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFSequenceClassifierOutput or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (LayoutLMv3Config) and inputs.
loss (tf.Tensor
of shape (batch_size, )
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (tf.Tensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape
(batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFLayoutLMv3ForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, TFAutoModelForSequenceClassification
>>> from datasets import load_dataset
>>> import tensorflow as tf
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = TFAutoModelForSequenceClassification.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, words, boxes=boxes, return_tensors="tf")
>>> sequence_label = tf.convert_to_tensor([1])
>>> outputs = model(**encoding, labels=sequence_label)
>>> loss = outputs.loss
>>> logits = outputs.logits
( config: LayoutLMv3Config **kwargs )
Parameters
LayoutLMv3 Model with a token classification head on top (a linear layer on top of the final hidden states) e.g. for sequence labeling (information extraction) tasks such as FUNSD, SROIE, CORD and Kleister-NDA.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just
pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second
format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with
the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first
positional argument:
input_ids
only and nothing else: model(input_ids)
model([input_ids, attention_mask])
or model([input_ids, attention_mask, token_type_ids])
model({"input_ids": input_ids, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
( input_ids: tf.Tensor | None = None bbox: tf.Tensor | None = None attention_mask: tf.Tensor | None = None token_type_ids: tf.Tensor | None = None position_ids: tf.Tensor | None = None head_mask: tf.Tensor | None = None inputs_embeds: tf.Tensor | None = None labels: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None pixel_values: tf.Tensor | None = None training: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFTokenClassifierOutput or tuple(tf.Tensor)
Parameters
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length, 4)
, optional) —
Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1)
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
tf.Tensor
of shape (batch_size, num_channels, height, width)
) —
Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
. tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
tf.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Labels for computing the token classification loss. Indices should be in [0, ..., config.num_labels - 1]
. Returns
transformers.modeling_tf_outputs.TFTokenClassifierOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFTokenClassifierOutput or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (LayoutLMv3Config) and inputs.
loss (tf.Tensor
of shape (n,)
, optional, where n is the number of unmasked labels, returned when labels
is provided) — Classification loss.
logits (tf.Tensor
of shape (batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape
(batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFLayoutLMv3ForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, TFAutoModelForTokenClassification
>>> from datasets import load_dataset
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = TFAutoModelForTokenClassification.from_pretrained("microsoft/layoutlmv3-base", num_labels=7)
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> word_labels = example["ner_tags"]
>>> encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="tf")
>>> outputs = model(**encoding)
>>> loss = outputs.loss
>>> logits = outputs.logits
( config: LayoutLMv3Config **kwargs )
Parameters
LayoutLMv3 Model with a span classification head on top for extractive question-answering tasks such as
DocVQA (a linear layer on top of the text part of the hidden-states output to
compute span start logits
and span end logits
).
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just
pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second
format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with
the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first
positional argument:
input_ids
only and nothing else: model(input_ids)
model([input_ids, attention_mask])
or model([input_ids, attention_mask, token_type_ids])
model({"input_ids": input_ids, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
( input_ids: tf.Tensor | None = None attention_mask: tf.Tensor | None = None token_type_ids: tf.Tensor | None = None position_ids: tf.Tensor | None = None head_mask: tf.Tensor | None = None inputs_embeds: tf.Tensor | None = None start_positions: tf.Tensor | None = None end_positions: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None bbox: tf.Tensor | None = None pixel_values: tf.Tensor | None = None return_dict: Optional[bool] = None training: bool = False ) → transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput or tuple(tf.Tensor)
Parameters
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length, 4)
, optional) —
Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1)
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1,
y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
tf.Tensor
of shape (batch_size, num_channels, height, width)
) —
Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
. tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS]
token. See pixel_values
for patch_sequence_length
.
tf.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. tf.Tensor
of shape (batch_size,)
, optional) —
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence
are not taken into account for computing the loss. tf.Tensor
of shape (batch_size,)
, optional) —
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence
are not taken into account for computing the loss. Returns
transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput or a tuple of tf.Tensor
(if
return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the
configuration (LayoutLMv3Config) and inputs.
loss (tf.Tensor
of shape (batch_size, )
, optional, returned when start_positions
and end_positions
are provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
start_logits (tf.Tensor
of shape (batch_size, sequence_length)
) — Span-start scores (before SoftMax).
end_logits (tf.Tensor
of shape (batch_size, sequence_length)
) — Span-end scores (before SoftMax).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape
(batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFLayoutLMv3ForQuestionAnswering forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, TFAutoModelForQuestionAnswering
>>> from datasets import load_dataset
>>> import tensorflow as tf
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = TFAutoModelForQuestionAnswering.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> question = "what's his name?"
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, question, words, boxes=boxes, return_tensors="tf")
>>> start_positions = tf.convert_to_tensor([1])
>>> end_positions = tf.convert_to_tensor([3])
>>> outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions)
>>> loss = outputs.loss
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits