The LayoutLMv3 model was proposed in LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3 simplifies LayoutLMv2 by using patch embeddings (as in ViT) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA).

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.

This model was contributed by nielsr. The TensorFlow version of this model was added by chriskoo, tokec, and lre. The original code can be found here.

Usage tips

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

LayoutLMv3Config

LayoutLMv3FeatureExtractor

LayoutLMv3ImageProcessor

LayoutLMv3Tokenizer

LayoutLMv3

Overview

Usage tips

Resources

LayoutLMv3Config

class transformers.LayoutLMv3Config

LayoutLMv3FeatureExtractor

class transformers.LayoutLMv3FeatureExtractor

__call__

LayoutLMv3ImageProcessor

class transformers.LayoutLMv3ImageProcessor

preprocess

LayoutLMv3Tokenizer

class transformers.LayoutLMv3Tokenizer

__call__

save_vocabulary

LayoutLMv3TokenizerFast

class transformers.LayoutLMv3TokenizerFast

__call__

LayoutLMv3Processor

class transformers.LayoutLMv3Processor

__call__

LayoutLMv3Model

class transformers.LayoutLMv3Model

forward

LayoutLMv3ForSequenceClassification

class transformers.LayoutLMv3ForSequenceClassification

forward

LayoutLMv3ForTokenClassification

class transformers.LayoutLMv3ForTokenClassification

forward

LayoutLMv3ForQuestionAnswering

class transformers.LayoutLMv3ForQuestionAnswering

forward

TFLayoutLMv3Model

class transformers.TFLayoutLMv3Model

call

TFLayoutLMv3ForSequenceClassification

class transformers.TFLayoutLMv3ForSequenceClassification

call

TFLayoutLMv3ForTokenClassification

class transformers.TFLayoutLMv3ForTokenClassification

call

TFLayoutLMv3ForQuestionAnswering

class transformers.TFLayoutLMv3ForQuestionAnswering

call

call

call

call

call