LayoutLMV2

Overview

The LayoutLMV2 model was proposed in LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves LayoutLM to obtain state-of-the-art results across several document image understanding benchmarks:

information extraction from scanned documents: the FUNSD dataset (a collection of 199 annotated forms comprising more than 30,000 words), the CORD dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the SROIE dataset (a collection of 626 receipts for training and 347 receipts for testing) and the Kleister-NDA dataset (a collection of non-disclosure agreements from the EDGAR database, including 254 documents for training, 83 documents for validation, and 203 documents for testing).
document image classification: the RVL-CDIP dataset (a collection of 400,000 images belonging to one of 16 classes).
document visual question answering: the DocVQA dataset (a collection of 50,000 questions defined on 12,000+ document images).

The abstract from the paper is the following:

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at this https URL.

LayoutLMv2 depends on detectron2, torchvision and tesseract. Run the following to install them:

python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
python -m pip install torchvision tesseract

(If you are developing for LayoutLMv2, note that passing the doctests also requires the installation of these packages.)

Tips:

The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
LayoutLMv2 adds both a relative 1D attention bias as well as a spatial 2D attention bias to the attention scores in the self-attention layers. Details can be found on page 5 of the paper.
Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found here.
LayoutLMv2 uses Facebook AI's Detectron2 package for its visual backbone. See this link for installation instructions.
In addition to input_ids, [~LayoutLMv2Model.forward] expects 2 additional inputs, namely image and bbox. The image input corresponds to the original document image in which the text tokens occur. The model expects each document image to be of size 224x224. This means that if you have a batch of document images, image should be a tensor of shape (batch_size, 3, 224, 224). This can be either a torch.Tensor or a Detectron2.structures.ImageList. You don't need to normalize the channels, as this is done by the model. Important to note is that the visual backbone expects BGR channels instead of RGB, as all models in Detectron2 are pre-trained using the BGR format. The bbox input are the bounding boxes (i.e. 2D-positions) of the input text tokens. This is identical to [LayoutLMModel]. These can be obtained using an external OCR engine such as Google's Tesseract (there's a Python wrapper available). Each bounding box should be in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000 scale. To normalize, you can use the following function:

def normalize_bbox(bbox, width, height):
    return [
        int(1000 * (bbox[0] / width)),
        int(1000 * (bbox[1] / height)),
        int(1000 * (bbox[2] / width)),
        int(1000 * (bbox[3] / height)),
    ]

Here, width and height correspond to the width and height of the original document in which the token occurs (before resizing the image). Those can be obtained using the Python Image Library (PIL) library for example, as follows:

from PIL import Image

image = Image.open(
    "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
)

width, height = image.size

However, this model includes a brand new [~transformers.LayoutLMv2Processor] which can be used to directly prepare data for the model (including applying OCR under the hood). More information can be found in the "Usage" section below.

Internally, [~transformers.LayoutLMv2Model] will send the image input through its visual backbone to obtain a lower-resolution feature map, whose shape is equal to the image_feature_pool_shape attribute of [~transformers.LayoutLMv2Config]. This feature map is then flattened to obtain a sequence of image tokens. As the size of the feature map is 7x7 by default, one obtains 49 image tokens. These are then concatenated with the text tokens, and send through the Transformer encoder. This means that the last hidden states of the model will have a length of 512 + 49 = 561, if you pad the text tokens up to the max length. More generally, the last hidden states will have a shape of seq_length + image_feature_pool_shape[0] * config.image_feature_pool_shape[1].
When calling [~transformers.LayoutLMv2Model.from_pretrained], a warning will be printed with a long list of parameter names that are not initialized. This is not a problem, as these parameters are batch normalization statistics, which are going to have values when fine-tuning on a custom dataset.
If you want to train the model in a distributed environment, make sure to call [synchronize_batch_norm] on the model in order to properly synchronize the batch normalization layers of the visual backbone.

In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on LayoutXLM's documentation page.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

A notebook on how to finetune LayoutLMv2 for text-classification on RVL-CDIP dataset.
See also: Text classification task guide

A notebook on how to finetune LayoutLMv2 for question-answering on DocVQA dataset.
See also: Question answering task guide
See also: Document question answering task guide

A notebook on how to finetune LayoutLMv2 for token-classification on CORD dataset.
A notebook on how to finetune LayoutLMv2 for token-classification on FUNSD dataset.
See also: Token classification task guide

Usage: LayoutLMv2Processor

The easiest way to prepare data for the model is to use [LayoutLMv2Processor], which internally combines a image processor ([LayoutLMv2ImageProcessor]) and a tokenizer ([LayoutLMv2Tokenizer] or [LayoutLMv2TokenizerFast]). The image processor handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one modality.

from transformers import LayoutLMv2ImageProcessor, LayoutLMv2TokenizerFast, LayoutLMv2Processor

image_processor = LayoutLMv2ImageProcessor()  # apply_ocr is set to True by default
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(image_processor, tokenizer)

In short, one can provide a document image (and possibly additional data) to [LayoutLMv2Processor], and it will create the inputs expected by the model. Internally, the processor first uses [LayoutLMv2ImageProcessor] to apply OCR on the image to get a list of words and normalized bounding boxes, as well to resize the image to a given size in order to get the image input. The words and normalized bounding boxes are then provided to [LayoutLMv2Tokenizer] or [LayoutLMv2TokenizerFast], which converts them to token-level input_ids, attention_mask, token_type_ids, bbox. Optionally, one can provide word labels to the processor, which are turned into token-level labels.

[LayoutLMv2Processor] uses PyTesseract, a Python wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of choice, and provide the words and normalized boxes yourself. This requires initializing [LayoutLMv2ImageProcessor] with apply_ocr set to False.

In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).

Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr = True

This is the simplest case, in which the processor (actually the image processor) will perform OCR on the image to get the words and normalized bounding boxes.

from transformers import LayoutLMv2Processor
from PIL import Image

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")

image = Image.open(
    "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
).convert("RGB")
encoding = processor(
    image, return_tensors="pt"
)  # you can also add all tokenizer parameters here such as padding, truncation
print(encoding.keys())
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])

Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False

In case one wants to do OCR themselves, one can initialize the image processor with apply_ocr set to False. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to the processor.

from transformers import LayoutLMv2Processor
from PIL import Image

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

image = Image.open(
    "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
).convert("RGB")
words = ["hello", "world"]
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]  # make sure to normalize your bounding boxes
encoding = processor(image, words, boxes=boxes, return_tensors="pt")
print(encoding.keys())
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])

Use case 3: token classification (training), apply_ocr=False

For token classification tasks (such as FUNSD, CORD, SROIE, Kleister-NDA), one can also provide the corresponding word labels in order to train a model. The processor will then convert these into token-level labels. By default, it will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the ignore_index of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can initialize the tokenizer with only_label_first_subword set to False.

from transformers import LayoutLMv2Processor
from PIL import Image

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

image = Image.open(
    "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
).convert("RGB")
words = ["hello", "world"]
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]  # make sure to normalize your bounding boxes
word_labels = [1, 2]
encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
print(encoding.keys())
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])

Use case 4: visual question answering (inference), apply_ocr=True

For visual question answering tasks (such as DocVQA), you can provide a question to the processor. By default, the processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].

from transformers import LayoutLMv2Processor
from PIL import Image

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")

image = Image.open(
    "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
).convert("RGB")
question = "What's his name?"
encoding = processor(image, question, return_tensors="pt")
print(encoding.keys())
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])

Use case 5: visual question answering (inference), apply_ocr=False

For visual question answering tasks (such as DocVQA), you can provide a question to the processor. If you want to perform OCR yourself, you can provide your own words and (normalized) bounding boxes to the processor.

from transformers import LayoutLMv2Processor
from PIL import Image

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

image = Image.open(
    "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
).convert("RGB")
question = "What's his name?"
words = ["hello", "world"]
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]  # make sure to normalize your bounding boxes
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
print(encoding.keys())
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])

LayoutLMv2Config

[[autodoc]] LayoutLMv2Config

LayoutLMv2FeatureExtractor

[[autodoc]] LayoutLMv2FeatureExtractor - call

LayoutLMv2ImageProcessor

[[autodoc]] LayoutLMv2ImageProcessor - preprocess

LayoutLMv2Tokenizer

[[autodoc]] LayoutLMv2Tokenizer - call - save_vocabulary

LayoutLMv2TokenizerFast

[[autodoc]] LayoutLMv2TokenizerFast - call

LayoutLMv2Processor

[[autodoc]] LayoutLMv2Processor - call

LayoutLMv2Model

[[autodoc]] LayoutLMv2Model - forward

LayoutLMv2ForSequenceClassification

[[autodoc]] LayoutLMv2ForSequenceClassification

LayoutLMv2ForTokenClassification

[[autodoc]] LayoutLMv2ForTokenClassification

LayoutLMv2ForQuestionAnswering

[[autodoc]] LayoutLMv2ForQuestionAnswering