Transformers documentation

LayoutLMv3

Join the Hugging Face community

to get started

# LayoutLMv3

## Overview

The LayoutLMv3 model was proposed in LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3 simplifies LayoutLMv2 by using patch embeddings (as in ViT) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA).

The abstract from the paper is the following:

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.

Tips:

• In terms of data processing, LayoutLMv3 is identical to its predecessor LayoutLMv2, except that:
• images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
• text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. Due to these differences in data preprocessing, one can use LayoutLMv3Processor which internally combines a LayoutLMv3FeatureExtractor (for the image modality) and a LayoutLMv3Tokenizer/LayoutLMv3TokenizerFast (for the text modality) to prepare all data for the model.
• Regarding usage of LayoutLMv3Processor, we refer to the usage guide of its predecessor.
• Demo notebooks for LayoutLMv3 can be found here.
LayoutLMv3 architecture. Taken from the original paper.

This model was contributed by nielsr. The original code can be found here.

## LayoutLMv3Config

### class transformers.LayoutLMv3Config

< >

( vocab_size = 50265 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-05 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 max_2d_position_embeddings = 1024 coordinate_size = 128 shape_size = 128 has_relative_attention_bias = True rel_pos_bins = 32 max_rel_pos = 128 rel_2d_pos_bins = 64 max_rel_2d_pos = 256 has_spatial_attention_bias = True text_embed = True visual_embed = True input_size = 224 num_channels = 3 patch_size = 16 classifier_dropout = None **kwargs )

Parameters

• vocab_size (int, optional, defaults to 50265) — Vocabulary size of the LayoutLMv3 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling LayoutLMv3Model.
• hidden_size (int, optional, defaults to 768) — Dimension of the encoder layers and the pooler layer.
• num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
• num_attention_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.
• intermediate_size (int, optional, defaults to 3072) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
• hidden_act (str or function, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.
• hidden_dropout_prob (float, optional, defaults to 0.1) — The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
• attention_probs_dropout_prob (float, optional, defaults to 0.1) — The dropout ratio for the attention probabilities.
• max_position_embeddings (int, optional, defaults to 512) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
• type_vocab_size (int, optional, defaults to 2) — The vocabulary size of the token_type_ids passed when calling LayoutLMv3Model.
• initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
• layer_norm_eps (float, optional, defaults to 1e-5) — The epsilon used by the layer normalization layers.
• max_2d_position_embeddings (int, optional, defaults to 1024) — The maximum value that the 2D position embedding might ever be used with. Typically set this to something large just in case (e.g., 1024).
• coordinate_size (int, optional, defaults to 128) — Dimension of the coordinate embeddings.
• shape_size (int, optional, defaults to 128) — Dimension of the width and height embeddings.
• has_relative_attention_bias (bool, optional, defaults to True) — Whether or not to use a relative attention bias in the self-attention mechanism.
• rel_pos_bins (int, optional, defaults to 32) — The number of relative position bins to be used in the self-attention mechanism.
• max_rel_pos (int, optional, defaults to 128) — The maximum number of relative positions to be used in the self-attention mechanism.
• max_rel_2d_pos (int, optional, defaults to 256) — The maximum number of relative 2D positions in the self-attention mechanism.
• rel_2d_pos_bins (int, optional, defaults to 64) — The number of 2D relative position bins in the self-attention mechanism.
• has_spatial_attention_bias (bool, optional, defaults to True) — Whether or not to use a spatial attention bias in the self-attention mechanism.
• visual_embed (bool, optional, defaults to True) — Whether or not to add patch embeddings.
• input_size (int, optional, defaults to 224) — The size (resolution) of the images.
• num_channels (int, optional, defaults to 3) — The number of channels of the images.
• patch_size (int, optional, defaults to 16) — The size (resolution) of the patches.
• classifier_dropout (float, optional) — The dropout ratio for the classification head.

This is the configuration class to store the configuration of a LayoutLMv3Model. It is used to instantiate an LayoutLMv3 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LayoutLMv3 microsoft/layoutlmv3-base architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import LayoutLMv3Model, LayoutLMv3Config

>>> # Initializing a LayoutLMv3 microsoft/layoutlmv3-base style configuration
>>> configuration = LayoutLMv3Config()

>>> # Initializing a model from the microsoft/layoutlmv3-base style configuration
>>> model = LayoutLMv3Model(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

## LayoutLMv3FeatureExtractor

### class transformers.LayoutLMv3FeatureExtractor

< >

( do_resize = True size = 224 resample = <Resampling.BILINEAR: 2> do_normalize = True image_mean = None image_std = None apply_ocr = True ocr_lang = None **kwargs )

Parameters

• do_resize (bool, optional, defaults to True) — Whether to resize the input to a certain size.
• size (int or Tuple(int), optional, defaults to 224) — Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an integer is provided, then the input will be resized to (size, size). Only has an effect if do_resize is set to True.
• resample (int, optional, defaults to PIL.Image.BILINEAR) — An optional resampling filter. This can be one of PIL.Image.NEAREST, PIL.Image.BOX, PIL.Image.BILINEAR, PIL.Image.HAMMING, PIL.Image.BICUBIC or PIL.Image.LANCZOS. Only has an effect if do_resize is set to True.
• do_normalize (bool, optional, defaults to True) — Whether or not to normalize the input with mean and standard deviation.
• image_mean (List[int], defaults to [0.5, 0.5, 0.5]) — The sequence of means for each channel, to be used when normalizing images.
• image_std (List[int], defaults to [0.5, 0.5, 0.5]) — The sequence of standard deviations for each channel, to be used when normalizing images.
• apply_ocr (bool, optional, defaults to True) — Whether to apply the Tesseract OCR engine to get words + normalized bounding boxes.
• ocr_lang (Optional[str], optional) — The language, specified by its ISO code, to be used by the Tesseract OCR engine. By default, English is used.

LayoutLMv3FeatureExtractor uses Google’s Tesseract OCR engine under the hood.

Constructs a LayoutLMv3 feature extractor. This can be used to resize + normalize document images, as well as to apply OCR on them in order to get a list of words and normalized bounding boxes.

This feature extractor inherits from PreTrainedFeatureExtractor() which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

#### __call__

< >

( images: typing.Union[PIL.Image.Image, numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None **kwargs ) BatchFeature

Parameters

• images (PIL.Image.Image, np.ndarray, torch.Tensor, List[PIL.Image.Image], List[np.ndarray], List[torch.Tensor]) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width.
• return_tensors (str or TensorType, optional, defaults to 'np') — If set, will return tensors of a particular framework. Acceptable values are:

• 'tf': Return TensorFlow tf.constant objects.
• 'pt': Return PyTorch torch.Tensor objects.
• 'np': Return NumPy np.ndarray objects.
• 'jax': Return JAX jnp.ndarray objects.

Returns

BatchFeature

A BatchFeature with the following fields:

• pixel_values — Pixel values to be fed to a model, of shape (batch_size, num_channels, height, width).
• words — Optional words as identified by Tesseract OCR (only when LayoutLMv3FeatureExtractor was initialized with apply_ocr set to True).
• boxes — Optional bounding boxes as identified by Tesseract OCR, normalized based on the image size (only when LayoutLMv3FeatureExtractor was initialized with apply_ocr set to True).

Main method to prepare for the model one or several image(s).

Examples:

>>> from transformers import LayoutLMv3FeatureExtractor
>>> from PIL import Image

>>> image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")

>>> # option 1: with apply_ocr=True (default)
>>> feature_extractor = LayoutLMv3FeatureExtractor()
>>> encoding = feature_extractor(image, return_tensors="pt")
>>> print(encoding.keys())
>>> # dict_keys(['pixel_values', 'words', 'boxes'])

>>> # option 2: with apply_ocr=False
>>> feature_extractor = LayoutLMv3FeatureExtractor(apply_ocr=False)
>>> encoding = feature_extractor(image, return_tensors="pt")
>>> print(encoding.keys())
>>> # dict_keys(['pixel_values'])

## LayoutLMv3Tokenizer

### class transformers.LayoutLMv3Tokenizer

< >

( vocab_file merges_file errors = 'replace' bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True cls_token_box = [0, 0, 0, 0] sep_token_box = [0, 0, 0, 0] pad_token_box = [0, 0, 0, 0] pad_token_label = -100 only_label_first_subword = True **kwargs )

Parameters

• vocab_file (str) — Path to the vocabulary file.
• merges_file (str) — Path to the merges file.
• errors (str, optional, defaults to "replace") — Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.
• bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

• eos_token (str, optional, defaults to "</s>") — The end of sequence token.

When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

• sep_token (str, optional, defaults to "</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
• cls_token (str, optional, defaults to "<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
• unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
• pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.
• mask_token (str, optional, defaults to "<mask>") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
• add_prefix_space (bool, optional, defaults to False) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (RoBERTa tokenizer detect beginning of words by the preceding space).
• cls_token_box (List[int], optional, defaults to [0, 0, 0, 0]) — The bounding box to use for the special [CLS] token.
• sep_token_box (List[int], optional, defaults to [0, 0, 0, 0]) — The bounding box to use for the special [SEP] token.
• pad_token_box (List[int], optional, defaults to [0, 0, 0, 0]) — The bounding box to use for the special [PAD] token.
• pad_token_label (int, optional, defaults to -100) — The label to use for padding tokens. Defaults to -100, which is the ignore_index of PyTorch’s CrossEntropyLoss.
• only_label_first_subword (bool, optional, defaults to True) — Whether or not to only label the first subword, in case word labels are provided.

Construct a LayoutLMv3 tokenizer. Based on RoBERTatokenizer (Byte Pair Encoding or BPE). LayoutLMv3Tokenizer can be used to turn words, word-level bounding boxes and optional word labels to token-level input_ids, attention_mask, token_type_ids, bbox, and optional labels (for token classification).

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

LayoutLMv3Tokenizer runs end-to-end tokenization: punctuation splitting and wordpiece. It also turns the word-level bounding boxes into token-level bounding boxes.

#### __call__

< >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs )

Parameters

• text (str, List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings (words of a single example or questions of a batch of examples) or a list of list of strings (batch of words).
• text_pair (List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence should be a list of strings (pretokenized string).
• boxes (List[List[int]], List[List[List[int]]]) — Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale.
• word_labels (List[int], List[List[int]], optional) — Word-level integer labels (for token classification tasks such as FUNSD, CORD).
• add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.
• padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:

• True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
• 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
• False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
• truncation (bool, str or TruncationStrategy, optional, defaults to False) — Activates and controls truncation. Accepts the following values:

• True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
• 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
• 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
• False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
• max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

• stride (int, optional, defaults to 0) — If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
• pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
• return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:

• 'tf': Return TensorFlow tf.constant objects.
• 'pt': Return PyTorch torch.Tensor objects.
• 'np': Return Numpy np.ndarray objects.
• add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.
• padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:

• True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
• 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
• False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
• truncation (bool, str or TruncationStrategy, optional, defaults to False) — Activates and controls truncation. Accepts the following values:

• True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
• 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
• 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
• False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
• max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
• stride (int, optional, defaults to 0) — If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
• pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
• return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:

• 'tf': Return TensorFlow tf.constant objects.
• 'pt': Return PyTorch torch.Tensor objects.
• 'np': Return Numpy np.ndarray objects.

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.

#### save_vocabulary

< >

( save_directory: str filename_prefix: typing.Optional[str] = None )

## LayoutLMv3TokenizerFast

### class transformers.LayoutLMv3TokenizerFast

< >

( vocab_file = None merges_file = None tokenizer_file = None errors = 'replace' bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True trim_offsets = True cls_token_box = [0, 0, 0, 0] sep_token_box = [0, 0, 0, 0] pad_token_box = [0, 0, 0, 0] pad_token_label = -100 only_label_first_subword = True **kwargs )

Parameters

• vocab_file (str) — Path to the vocabulary file.
• merges_file (str) — Path to the merges file.
• errors (str, optional, defaults to "replace") — Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.
• bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

• eos_token (str, optional, defaults to "</s>") — The end of sequence token.

When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

• sep_token (str, optional, defaults to "</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
• cls_token (str, optional, defaults to "<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
• unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
• pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.
• mask_token (str, optional, defaults to "<mask>") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
• add_prefix_space (bool, optional, defaults to False) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (RoBERTa tokenizer detect beginning of words by the preceding space).
• trim_offsets (bool, optional, defaults to True) — Whether the post processing step should trim offsets to avoid including whitespaces.
• cls_token_box (List[int], optional, defaults to [0, 0, 0, 0]) — The bounding box to use for the special [CLS] token.
• sep_token_box (List[int], optional, defaults to [0, 0, 0, 0]) — The bounding box to use for the special [SEP] token.
• pad_token_box (List[int], optional, defaults to [0, 0, 0, 0]) — The bounding box to use for the special [PAD] token.
• pad_token_label (int, optional, defaults to -100) — The label to use for padding tokens. Defaults to -100, which is the ignore_index of PyTorch’s CrossEntropyLoss.
• only_label_first_subword (bool, optional, defaults to True) — Whether or not to only label the first subword, in case word labels are provided.

Construct a “fast” LayoutLMv3 tokenizer (backed by HuggingFace’s tokenizers library). Based on BPE.

This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

#### __call__

< >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs )

Parameters

• text (str, List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings (words of a single example or questions of a batch of examples) or a list of list of strings (batch of words).
• text_pair (List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence should be a list of strings (pretokenized string).
• boxes (List[List[int]], List[List[List[int]]]) — Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale.
• word_labels (List[int], List[List[int]], optional) — Word-level integer labels (for token classification tasks such as FUNSD, CORD).
• add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.
• padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:

• True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
• 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
• False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
• truncation (bool, str or TruncationStrategy, optional, defaults to False) — Activates and controls truncation. Accepts the following values:

• True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
• 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
• 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
• False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
• max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

• stride (int, optional, defaults to 0) — If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
• pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
• return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:

• 'tf': Return TensorFlow tf.constant objects.
• 'pt': Return PyTorch torch.Tensor objects.
• 'np': Return Numpy np.ndarray objects.
• add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.
• padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:

• True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
• 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
• False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
• truncation (bool, str or TruncationStrategy, optional, defaults to False) — Activates and controls truncation. Accepts the following values:

• True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
• 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
• 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
• False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
• max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
• stride (int, optional, defaults to 0) — If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
• pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
• return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:

• 'tf': Return TensorFlow tf.constant objects.
• 'pt': Return PyTorch torch.Tensor objects.
• 'np': Return Numpy np.ndarray objects.

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.

## LayoutLMv3Processor

### class transformers.LayoutLMv3Processor

< >

( *args **kwargs )

Parameters

• feature_extractor (LayoutLMv3FeatureExtractor) — An instance of LayoutLMv3FeatureExtractor. The feature extractor is a required input.
• tokenizer (LayoutLMv3Tokenizer or LayoutLMv3TokenizerFast) — An instance of LayoutLMv3Tokenizer or LayoutLMv3TokenizerFast. The tokenizer is a required input.

Constructs a LayoutLMv3 processor which combines a LayoutLMv3 feature extractor and a LayoutLMv3 tokenizer into a single processor.

LayoutLMv3Processor offers all the functionalities you need to prepare data for the model.

It first uses LayoutLMv3FeatureExtractor to resize and normalize document images, and optionally applies OCR to get words and normalized bounding boxes. These are then provided to LayoutLMv3Tokenizer or LayoutLMv3TokenizerFast, which turns the words and bounding boxes into token-level input_ids, attention_mask, token_type_ids, bbox. Optionally, one can provide integer word_labels, which are turned into token-level labels for token classification tasks (such as FUNSD, CORD).

#### __call__

< >

( images text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None **kwargs )

This method first forwards the images argument to call(). In case LayoutLMv3FeatureExtractor was initialized with apply_ocr set to True, it passes the obtained words and bounding boxes along with the additional arguments to call() and returns the output, together with resized and normalized pixel_values. In case LayoutLMv3FeatureExtractor was initialized with apply_ocr set to False, it passes the words (text/text_pair) and boxes specified by the user along with the additional arguments to call() and returns the output, together with resized and normalized pixel_values.

## LayoutLMv3Model

### class transformers.LayoutLMv3Model

< >

( config )

Parameters

• config (LayoutLMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare LayoutLMv3 Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

#### forward

< >

( input_ids = None bbox = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None pixel_values = None output_attentions = None output_hidden_states = None return_dict = None ) transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)

Parameters

• input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary.

Indices can be obtained using LayoutLMv2Tokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?

• bbox (torch.LongTensor of shape ((batch_size, sequence_length), 4), optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
• pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Batch of document images.
• attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

• 1 for tokens that are not masked,
• 0 for tokens that are masked.

• token_type_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

• 0 corresponds to a sentence A token,
• 1 corresponds to a sentence B token.

What are token type IDs?

• position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

What are position IDs?

• head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:

• inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
• output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
• output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
• return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)

A transformers.modeling_outputs.BaseModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

• last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.

• hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The LayoutLMv3Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

>>> from transformers import AutoProcessor, AutoModel

>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModel.from_pretrained("microsoft/layoutlmv3-base")

>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]

>>> encoding = processor(image, words, boxes=boxes, return_tensors="pt")

>>> outputs = model(**encoding)
>>> last_hidden_states = outputs.last_hidden_state

## LayoutLMv3ForSequenceClassification

### class transformers.LayoutLMv3ForSequenceClassification

< >

( config )

Parameters

• config (LayoutLMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

LayoutLMv3 Model with a sequence classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for document image classification tasks such as the RVL-CDIP dataset.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

#### forward

< >

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None labels = None output_attentions = None output_hidden_states = None return_dict = None bbox = None pixel_values = None ) transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

Parameters

• input_ids (torch.LongTensor of shape batch_size, sequence_length) — Indices of input sequence tokens in the vocabulary.

Indices can be obtained using LayoutLMv2Tokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?

• bbox (torch.LongTensor of shape (batch_size, sequence_length, 4), optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
• pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Batch of document images.
• attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

• 1 for tokens that are not masked,
• 0 for tokens that are masked.

• token_type_ids (torch.LongTensor of shape batch_size, sequence_length, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

• 0 corresponds to a sentence A token,
• 1 corresponds to a sentence B token.

What are token type IDs?

• position_ids (torch.LongTensor of shape batch_size, sequence_length, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

What are position IDs?

• head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:

• inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
• output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
• output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
• return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

• loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.

• logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).

• hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The LayoutLMv3ForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

>>> from transformers import AutoProcessor, AutoModelForSequenceClassification
>>> import torch

>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModelForSequenceClassification.from_pretrained("microsoft/layoutlmv3-base")

>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]

>>> encoding = processor(image, words, boxes=boxes, return_tensors="pt")
>>> sequence_label = torch.tensor([1])

>>> outputs = model(**encoding, labels=sequence_label)
>>> loss = outputs.loss
>>> logits = outputs.logits

## LayoutLMv3ForTokenClassification

### class transformers.LayoutLMv3ForTokenClassification

< >

( config )

Parameters

• config (LayoutLMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

LayoutLMv3 Model with a token classification head on top (a linear layer on top of the final hidden states) e.g. for sequence labeling (information extraction) tasks such as FUNSD, SROIE, CORD and Kleister-NDA.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

#### forward

< >

( input_ids = None bbox = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None labels = None output_attentions = None output_hidden_states = None return_dict = None pixel_values = None ) transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)

Parameters

• input_ids (torch.LongTensor of shape batch_size, sequence_length) — Indices of input sequence tokens in the vocabulary.

Indices can be obtained using LayoutLMv2Tokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?

• bbox (torch.LongTensor of shape (batch_size, sequence_length, 4), optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
• pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Batch of document images.
• attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

• 1 for tokens that are not masked,
• 0 for tokens that are masked.

• token_type_ids (torch.LongTensor of shape batch_size, sequence_length, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

• 0 corresponds to a sentence A token,
• 1 corresponds to a sentence B token.

What are token type IDs?

• position_ids (torch.LongTensor of shape batch_size, sequence_length, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

What are position IDs?

• head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:

• inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
• output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
• output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
• return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
• labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the token classification loss. Indices should be in [0, ..., config.num_labels - 1].

Returns

transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)

A transformers.modeling_outputs.TokenClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

• loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification loss.

• logits (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)) — Classification scores (before SoftMax).

• hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The LayoutLMv3ForTokenClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

>>> from transformers import AutoProcessor, AutoModelForTokenClassification

>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModelForTokenClassification.from_pretrained("microsoft/layoutlmv3-base", num_labels=7)

>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> word_labels = example["ner_tags"]

>>> encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")

>>> outputs = model(**encoding)
>>> loss = outputs.loss
>>> logits = outputs.logits

< >

( config )

Parameters

• config (LayoutLMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

LayoutLMv3 Model with a span classification head on top for extractive question-answering tasks such as DocVQA (a linear layer on top of the text part of the hidden-states output to compute span start logits and span end logits).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

#### forward

< >

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None start_positions = None end_positions = None output_attentions = None output_hidden_states = None return_dict = None bbox = None pixel_values = None ) transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)

Parameters

• input_ids (torch.LongTensor of shape batch_size, sequence_length) — Indices of input sequence tokens in the vocabulary.

Indices can be obtained using LayoutLMv2Tokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?

• bbox (torch.LongTensor of shape (batch_size, sequence_length, 4), optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
• pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Batch of document images.
• attention_mask (torch.FloatTensor of shape batch_size, sequence_length, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

• 1 for tokens that are not masked,
• 0 for tokens that are masked.

• token_type_ids (torch.LongTensor of shape batch_size, sequence_length, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

• 0 corresponds to a sentence A token,
• 1 corresponds to a sentence B token.

What are token type IDs?

• position_ids (torch.LongTensor of shape batch_size, sequence_length, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

What are position IDs?

• head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]:

• inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
• output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
• output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
• return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
• start_positions (torch.LongTensor of shape (batch_size,), optional) — Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.
• end_positions (torch.LongTensor of shape (batch_size,), optional) — Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

Returns

transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)

A transformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.

• loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.

• start_logits (torch.FloatTensor of shape (batch_size, sequence_length)) — Span-start scores (before SoftMax).

• end_logits (torch.FloatTensor of shape (batch_size, sequence_length)) — Span-end scores (before SoftMax).

• hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

• attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The LayoutLMv3ForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

>>> from transformers import AutoProcessor, AutoModelForQuestionAnswering
>>> import torch

>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)

>>> example = dataset[0]
>>> image = example["image"]
>>> question = "what's his name?"
>>> words = example["tokens"]
>>> boxes = example["bboxes"]

>>> encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])

>>> outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions)
>>> loss = outputs.loss
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits`