The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains state-of-the-art results on 2 important benchmarks:

Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available.

Usage tips

Usage: MarkupLMProcessor

The easiest way to prepare data for the model is to use MarkupLMProcessor, which internally combines a feature extractor (MarkupLMFeatureExtractor) and a tokenizer (MarkupLMTokenizer or MarkupLMTokenizerFast). The feature extractor is used to extract all nodes and xpaths from the HTML strings, which are then provided to the tokenizer, which turns them into the token-level inputs of the model (input_ids etc.). Note that you can still use the feature extractor and tokenizer separately, if you only want to handle one of the two tasks.

In short, one can provide HTML strings (and possibly additional data) to MarkupLMProcessor, and it will create the inputs expected by the model. Internally, the processor first uses MarkupLMFeatureExtractor to get a list of nodes and corresponding xpaths. The nodes and xpaths are then provided to MarkupLMTokenizer or MarkupLMTokenizerFast, which converts them to token-level input_ids, attention_mask, token_type_ids, xpath_subs_seq, xpath_tags_seq. Optionally, one can provide node labels to the processor, which are turned into token-level labels.

In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).

Use case 1: web page classification (training, inference) + token classification (inference), parse_html = True

This is the simplest case, in which the processor will use the feature extractor to get all nodes and xpaths from the HTML.

Use case 2: web page classification (training, inference) + token classification (inference), parse_html=False

In case one already has obtained all nodes and xpaths, one doesn’t need the feature extractor. In that case, one should provide the nodes and corresponding xpaths themselves to the processor, and make sure to set parse_html to False.

For token classification tasks (such as SWDE), one can also provide the corresponding node labels in order to train a model. The processor will then convert these into token-level labels. By default, it will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the ignore_index of PyTorch’s CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can initialize the tokenizer with only_label_first_subword set to False.

For question answering tasks on web pages, you can provide a question to the processor. By default, the processor will use the feature extractor to get all nodes and xpaths, and create [CLS] question tokens [SEP] word tokens [SEP].

For question answering tasks (such as WebSRC), you can provide a question to the processor. If you have extracted all nodes and xpaths yourself, you can provide them directly to the processor. Make sure to set parse_html to False.

Resources

MarkupLMConfig

MarkupLM

Overview

Usage tips

Usage: MarkupLMProcessor

Resources

MarkupLMConfig

class transformers.MarkupLMConfig

MarkupLMFeatureExtractor

class transformers.MarkupLMFeatureExtractor

__call__

MarkupLMTokenizer

class transformers.MarkupLMTokenizer

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

save_vocabulary

MarkupLMTokenizerFast

class transformers.MarkupLMTokenizerFast

batch_encode_plus

build_inputs_with_special_tokens

create_token_type_ids_from_sequences

encode_plus

get_xpath_seq

MarkupLMProcessor

class transformers.MarkupLMProcessor

__call__

MarkupLMModel

class transformers.MarkupLMModel

forward

MarkupLMForSequenceClassification

class transformers.MarkupLMForSequenceClassification

forward

MarkupLMForTokenClassification

class transformers.MarkupLMForTokenClassification

forward

MarkupLMForQuestionAnswering

class transformers.MarkupLMForQuestionAnswering

forward

call

call