Spaces:
Sleeping
Sleeping
File size: 9,202 Bytes
6fc683c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# LayoutLMv3 (Document Foundation Model)
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose **LayoutLMv3** to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.

## Installation
``` bash
conda create --name layoutlmv3 python=3.7
conda activate layoutlmv3
git clone https://github.com/microsoft/unilm.git
cd unilm/layoutlmv3
pip install -r requirements.txt
# install pytorch, torchvision refer to https://pytorch.org/get-started/locally/
pip install torch==1.10.0+cu111 torchvision==0.11.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
# install detectron2 refer to https://detectron2.readthedocs.io/en/latest/tutorials/install.html
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.10/index.html
pip install -e .
```
## Pre-trained Models
| Model | Model Name (Path) |
|------------------|---------------------------------------------------------------------------------|
| layoutlmv3-base | [microsoft/layoutlmv3-base](https://huggingface.co/microsoft/layoutlmv3-base) |
| layoutlmv3-large | [microsoft/layoutlmv3-large](https://huggingface.co/microsoft/layoutlmv3-large) |
| layoutlmv3-base-chinese | [microsoft/layoutlmv3-base-chinese](https://huggingface.co/microsoft/layoutlmv3-base-chinese) |
## Fine-tuning Examples
We provide some fine-tuned models and their train/test logs.
### Form Understanding on FUNSD
* Train
``` bash
python -m torch.distributed.launch \
--nproc_per_node=8 --master_port 4398 examples/run_funsd_cord.py \
--dataset_name funsd \
--do_train --do_eval \
--model_name_or_path microsoft/layoutlmv3-base \
--output_dir /path/to/layoutlmv3-base-finetuned-funsd \
--segment_level_layout 1 --visual_embed 1 --input_size 224 \
--max_steps 1000 --save_steps -1 --evaluation_strategy steps --eval_steps 100 \
--learning_rate 1e-5 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 \
--dataloader_num_workers 8
```
* Test
``` bash
python -m torch.distributed.launch \
--nproc_per_node=8 --master_port 4398 examples/run_funsd_cord.py \
--dataset_name funsd \
--do_eval \
--model_name_or_path HYPJUDY/layoutlmv3-base-finetuned-funsd \
--output_dir /path/to/layoutlmv3-base-finetuned-funsd \
--segment_level_layout 1 --visual_embed 1 --input_size 224 \
--dataloader_num_workers 8
```
| Model on FUNSD | precision | recall | f1 |
|-----------|:------------|:------:|:--------:|
| [layoutlmv3-base-finetuned-funsd](https://huggingface.co/HYPJUDY/layoutlmv3-base-finetuned-funsd) | 0.8955 | 0.9165 | 0.9059 |
| [layoutlmv3-large-finetuned-funsd](https://huggingface.co/HYPJUDY/layoutlmv3-large-finetuned-funsd) | 0.9219 | 0.9210 | 0.9215 |
### Document Layout Analysis on PubLayNet
Please follow [unilm/dit/object_detection](https://github.com/microsoft/unilm/blob/master/dit/object_detection/README.md) to prepare data and read more details about this task.
In the folder of layoutlmv3/examples/object_detecion:
* Train
Please firstly download the [pre-trained models](#pre-trained-models) to `/path/to/microsoft/layoutlmv3-base`, then run:
``` bash
python train_net.py --config-file cascade_layoutlmv3.yaml --num-gpus 16 \
MODEL.WEIGHTS /path/to/microsoft/layoutlmv3-base/pytorch_model.bin \
OUTPUT_DIR /path/to/layoutlmv3-base-finetuned-publaynet
```
* Test
If you want to test the [layoutlmv3-base-finetuned-publaynet](https://huggingface.co/HYPJUDY/layoutlmv3-base-finetuned-publaynet) model, please download it to `/path/to/layoutlmv3-base-finetuned-publaynet`, then run:
``` bash
python train_net.py --config-file cascade_layoutlmv3.yaml --eval-only --num-gpus 8 \
MODEL.WEIGHTS /path/to/layoutlmv3-base-finetuned-publaynet/model_final.pth \
OUTPUT_DIR /path/to/layoutlmv3-base-finetuned-publaynet
```
| Model on PubLayNet | Text | Title | List | Table | Figure | Overall |
|-------------------------------------------------------------------------------------------|:------------|:------:|:------:|-------|--------|---------|
| [layoutlmv3-base-finetuned-publaynet](https://huggingface.co/HYPJUDY/layoutlmv3-base-finetuned-publaynet) | 94.5 | 90.6 | 95.5 | 97.9 | 97.0 | 95.1 |
### Form Understanding on XFUND
An example for the LayoutLMv3 Chinese model to train and evaluate model.
#### Data Preparation
Download the chinese data in XFUND from this [link](https://github.com/doc-analysis/XFUND/releases/tag/v1.0).
The resulting directory structure looks like the following:
```
βββ data
β βββ zh.train.json
β βββ zh.val.json
β βββ images
β βββ zh_train_*.jpg
β βββ zh_val_*.jpg
```
* Train
``` bash
python -m torch.distributed.launch \
--nproc_per_node=8 --master_port 4398 examples/run_xfund.py \
--data_dir data --language zh \
--do_train --do_eval \
--model_name_or_path microsoft/layoutlmv3-base-chinese \
--output_dir path/to/output \
--segment_level_layout 1 --visual_embed 1 --input_size 224 \
--max_steps 1000 --save_steps -1 --evaluation_strategy steps --eval_steps 20 \
--learning_rate 7e-5 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 \
--dataloader_num_workers 8
```
* Test
``` bash
python -m torch.distributed.launch \
--nproc_per_node=8 --master_port 4398 examples/run_xfund.py \
--data_dir data --language zh \
--do_eval \
--model_name_or_path path/to/model \
--output_dir /path/to/output \
--segment_level_layout 1 --visual_embed 1 --input_size 224 \
--dataloader_num_workers 8
```
| Pre-trained Model | precision | recall | f1 |
|-----------|:------------|:------:|:--------:|
| [layoutlmv3-base-chinese](https://huggingface.co/microsoft/layoutlmv3-base-chinese) | 0.8980 | 0.9435 | 0.9202 |
We also fine-tune the LayoutLMv3 Chinese model on [EPHOIE](https://github.com/HCIILAB/EPHOIE) for reference.
| Pre-trained Model | Subject | Test Time | Name | School | Examination Number | Seat Number | Class | Student Number | Grade | Score | **Mean** |
|-----------------|:------------|:------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| [layoutlmv3-base-chinese](https://huggingface.co/microsoft/layoutlmv3-base-chinese) | 98.99 | 100 | 99.77 | 99.2 | 100 | 100 | 98.82 | 99.78 | 98.31 | 97.27 | 99.21 |
## Citation
If you find LayoutLMv3 helpful, please cite us:
```
@inproceedings{huang2022layoutlmv3,
author={Yupan Huang and Tengchao Lv and Lei Cui and Yutong Lu and Furu Wei},
title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
year={2022}
}
```
## Acknowledgement
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers),
[layoutlmv2](https://github.com/microsoft/unilm/tree/master/layoutlmv2),
[layoutlmft](https://github.com/microsoft/unilm/tree/master/layoutlmft),
[beit](https://github.com/microsoft/unilm/tree/master/beit),
[dit](https://github.com/microsoft/unilm/tree/master/dit)
and [Detectron2](https://github.com/facebookresearch/detectron2) projects.
We sincerely thank them for their contributions!
## License
The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
## Contact
For help or issues using LayoutLMv3, please email [Yupan Huang](https://github.com/HYPJUDY) or submit a GitHub issue.
For other communications related to LayoutLM, please contact [Lei Cui](mailto:lecu@microsoft.com) or [Furu Wei](mailto:fuwei@microsoft.com).
|