XDoc

Introduction

XDoc is a unified pre-trained model that deals with different document formats in a single model. With only 36.7% parameters, XDoc achieves comparable or better performance on downstream tasks, which is cost-effective for real-world deployment.

XDoc: Unified Pre-training for Cross-Format Document Understanding Jingye Chen, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei, EMNLP 2022

The overview of our framework is as follows:

Download

Pre-trained Model

Model	Download
xdoc-pretrain-roberta-1M	xdoc-base

Fine-tuning Models

Model	Download
xdoc-squad1.1	xdoc-squad1.1
xdoc-squad2.0	xdoc-squad2.0
xdoc-funsd	xdoc-funsd
xdoc-websrc	xdoc-websrc

Fine-tune

SQuAD

The dataset will be automatically downloaded. Please refer to ./fine_tuning/squad/.

Installation

pip install -r requirements.txt

Train

To train XDoc on SQuADv1.1

CUDA_VISIBLE_DEVICES=0 python run_squad.py \
  --model_name_or_path microsoft/xdoc-base \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 16 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ./v1_result \
  --overwrite_output_dir

To train XDoc on SQuADv2.0

CUDA_VISIBLE_DEVICES=0 python run_squad.py \
  --model_name_or_path microsoft/xdoc-base \
  --dataset_name squad_v2 \
  --do_train \
  --do_eval \
  --version_2_with_negative \
  --per_device_train_batch_size 16 \
  --learning_rate 3e-5 \
  --num_train_epochs 4 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ./v2_result \
  --overwrite_output_dir

Test

To test XDoc on SQuADv1.1

CUDA_VISIBLE_DEVICES=0 python run_squad.py \
  --model_name_or_path microsoft/xdoc-base-squad1.1 \
  --dataset_name squad \
  --do_eval \
  --per_device_train_batch_size 16 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ./squadv1.1_result \
  --overwrite_output_dir

To test XDoc on SQuADv2.0

CUDA_VISIBLE_DEVICES=0 python run_squad.py \
  --model_name_or_path microsoft/xdoc-base-squad2.0 \
  --dataset_name squad_v2 \
  --do_eval \
  --version_2_with_negative \
  --per_device_train_batch_size 16 \
  --learning_rate 3e-5 \
  --num_train_epochs 4 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ./squadv2.0_result \
  --overwrite_output_dir

FUNSD

The dataset will be automatically downloaded. Please refer to ./fine_tuning/funsd/.

Installation

pip install -r requirements.txt

Also, you need to install detectron2. For example, if you use torch1.8 with cuda version 10.1, you can use the following command

pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.8/index.html

Train

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
        --model_name_or_path microsoft/xdoc-base \
        --output_dir camera_ready_funsd_1M \
        --do_train \
        --do_eval \
        --max_steps 1000 \
        --warmup_ratio 0.1 \
        --fp16 \
        --overwrite_output_dir \
        --seed 42

Test

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
        --model_name_or_path microsoft/xdoc-base-funsd \
        --output_dir camera_ready_funsd_1M \
        --do_eval \
        --max_steps 1000 \
        --warmup_ratio 0.1 \
        --fp16 \
        --overwrite_output_dir \
        --seed 42

WebSRC

The dataset will be manually downloaded. After downloading, please modify the argument --web_train_file, --web_eval_file, web_root_dir, and root_dir in args.py.

Installation

pip install -r requirements.txt

Train

CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train True --do_eval True  --model_name_or_path microsoft/xdoc-base

Test

CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train False --do_eval True --model_name_or_path microsoft/xdoc-base-websrc

Result

To verify the model accuracy, we select the GLUE benchmark and SQuAD to evaluate plain text understanding, FUNSD and DocVQA to evaluate doc- ument understanding, and WebSRC for web text understanding. Experimental results have demonstrated that XDoc achieves comparable or even better performance on these tasks.

Model	MNLI-m	QNLI	SST2	MRPC	SQUAD1.1/2.0	FUNSD	DocVQA	WebSRC
RoBERTa	87.6	92.8	94.8	90.2	92.2/83.4	-	-	-
LayoutLM	-	-	-	-	-	79.3	69.2	-
MarkupLM	-	-	-	-	-	-	-	74.5
XDoc(Ours)	86.8	92.3	95.3	91.1	92.0/83.5	89.4	72.7	74.8

With only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment.

Model	Word	1D Position	Transformer	2D Position	XPath	Adaptive	Total
RoBERTa	√	√	√	-	-	-	128M
LayoutLM	√	√	√	√	-	-	131M
MarkupLM	√	√	√	-	√	-	139M
XDoc(Ours)	√	√	√	√	√	√	146M

Citation

If you find XDoc helpful, please cite us:

@article{chen2022xdoc,
  title={XDoc: Unified Pre-training for Cross-Format Document Understanding},
  author={Chen, Jingye and Lv, Tengchao and Cui, Lei and Zhang, Cha and Wei, Furu},
  journal={arXiv preprint arXiv:2210.02849},
  year={2022}
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers. Microsoft Open Source Code of Conduct

Contact

For help or issues using XDoc, please submit a GitHub issue.

For other communications, please contact Lei Cui (lecu@microsoft.com), Furu Wei (fuwei@microsoft.com).