LayoutLMv3

Model description

LayoutLMv3 is a pre-trained multimodal Transformer for Document AI with unified text and image masking. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model. For example, LayoutLMv3 can be fine-tuned for both text-centric tasks, including form understanding, receipt understanding, and document visual question answering, and image-centric tasks such as document image classification and document layout analysis.

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei, Preprint 2022.

Results

Results on XFUND

Language	Precision	Recall	F1
ZH	0.8910	0.9374	0.9136

Results on EPHOIE

Subject	Test Time	Name	School	Examination Number	Seat Number	Class	Student Number	Grade	Score	Mean
98.48	100	99.36	98.86	100	100	98.73	98.89	97.59	97.78	98.97

Citation

If you find LayoutLM useful in your research, please cite the following paper:

@article{huang2022layoutlmv3,
  title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
  author={Yupan Huang and Tengchao Lv and Lei Cui and Yutong Lu and Furu Wei},
  journal={arXiv preprint arXiv:2204.08387},
  year={2022}
}