Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.20.0
XDoc
Introduction
XDoc is a unified pre-trained model that deals with different document formats in a single model. With only 36.7% parameters, XDoc achieves comparable or better performance on downstream tasks, which is cost-effective for real-world deployment.
XDoc: Unified Pre-training for Cross-Format Document Understanding Jingye Chen, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei, EMNLP 2022
The overview of our framework is as follows:

Download
Pre-trained Model
Model | Download |
---|---|
xdoc-pretrain-roberta-1M | xdoc-base |
Fine-tuning Models
Model | Download |
---|---|
xdoc-squad1.1 | xdoc-squad1.1 |
xdoc-squad2.0 | xdoc-squad2.0 |
xdoc-funsd | xdoc-funsd |
xdoc-websrc | xdoc-websrc |
Fine-tune
SQuAD
The dataset will be automatically downloaded. Please refer to ./fine_tuning/squad/
.
Installation
pip install -r requirements.txt
Train
To train XDoc on SQuADv1.1
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./v1_result \
--overwrite_output_dir
To train XDoc on SQuADv2.0
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base \
--dataset_name squad_v2 \
--do_train \
--do_eval \
--version_2_with_negative \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./v2_result \
--overwrite_output_dir
Test
To test XDoc on SQuADv1.1
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base-squad1.1 \
--dataset_name squad \
--do_eval \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./squadv1.1_result \
--overwrite_output_dir
To test XDoc on SQuADv2.0
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base-squad2.0 \
--dataset_name squad_v2 \
--do_eval \
--version_2_with_negative \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./squadv2.0_result \
--overwrite_output_dir
FUNSD
The dataset will be automatically downloaded. Please refer to ./fine_tuning/funsd/
.
Installation
pip install -r requirements.txt
Also, you need to install detectron2
. For example, if you use torch1.8 with cuda version 10.1, you can use the following command
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.8/index.html
Train
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
--model_name_or_path microsoft/xdoc-base \
--output_dir camera_ready_funsd_1M \
--do_train \
--do_eval \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16 \
--overwrite_output_dir \
--seed 42
Test
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
--model_name_or_path microsoft/xdoc-base-funsd \
--output_dir camera_ready_funsd_1M \
--do_eval \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16 \
--overwrite_output_dir \
--seed 42
WebSRC
The dataset will be manually downloaded. After downloading, please modify the argument --web_train_file
, --web_eval_file
, web_root_dir
, and root_dir
in args.py.
Installation
pip install -r requirements.txt
Train
CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train True --do_eval True --model_name_or_path microsoft/xdoc-base
Test
CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train False --do_eval True --model_name_or_path microsoft/xdoc-base-websrc
Result
- To verify the model accuracy, we select the GLUE benchmark and SQuAD to evaluate plain text understanding, FUNSD and DocVQA to evaluate doc- ument understanding, and WebSRC for web text understanding. Experimental results have demonstrated that XDoc achieves comparable or even better performance on these tasks.
Model | MNLI-m | QNLI | SST2 | MRPC | SQUAD1.1/2.0 | FUNSD | DocVQA | WebSRC |
---|---|---|---|---|---|---|---|---|
RoBERTa | 87.6 | 92.8 | 94.8 | 90.2 | 92.2/83.4 | - | - | - |
LayoutLM | - | - | - | - | - | 79.3 | 69.2 | - |
MarkupLM | - | - | - | - | - | - | - | 74.5 |
XDoc(Ours) | 86.8 | 92.3 | 95.3 | 91.1 | 92.0/83.5 | 89.4 | 72.7 | 74.8 |
- With only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment.
Model | Word | 1D Position | Transformer | 2D Position | XPath | Adaptive | Total |
---|---|---|---|---|---|---|---|
RoBERTa | β | β | β | - | - | - | 128M |
LayoutLM | β | β | β | β | - | - | 131M |
MarkupLM | β | β | β | - | β | - | 139M |
XDoc(Ours) | β | β | β | β | β | β | 146M |
Citation
If you find XDoc helpful, please cite us:
@article{chen2022xdoc,
title={XDoc: Unified Pre-training for Cross-Format Document Understanding},
author={Chen, Jingye and Lv, Tengchao and Cui, Lei and Zhang, Cha and Wei, Furu},
journal={arXiv preprint arXiv:2210.02849},
year={2022}
}
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers. Microsoft Open Source Code of Conduct
Contact
For help or issues using XDoc, please submit a GitHub issue.
For other communications, please contact Lei Cui (lecu@microsoft.com
), Furu Wei (fuwei@microsoft.com
).