arxiv:2203.02378

DiT: Self-supervised Pre-training for Document Image Transformer

Published on Mar 4, 2022

Upvote

Authors:

Junlong Li ,

Yiheng Xu ,

Tengchao Lv ,

Furu Wei

Abstract

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 rightarrow 92.69), document layout analysis (91.0 rightarrow 94.9), table detection (94.23 rightarrow 96.55) and text detection for OCR (93.07 rightarrow 94.29). The code and pre-trained models are publicly available at https://aka.ms/msdit.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2203.02378 in a dataset README.md to link it from this page.

DiT: Self-supervised Pre-training for Document Image Transformer

Abstract

Community

Models citing this paper 4

Datasets citing this paper 0

Spaces citing this paper 15

Collections including this paper 1