Spaces:

Tzktz
/

Dit-document-layout-analysis

Sleeping

App Files Files Community

Dit-document-layout-analysis / unilm /xdoc /README.md

Tzktz

Upload 7664 files

6fc683c verified about 1 year ago

preview code

raw

history blame contribute delete

7.19 kB

	# XDoc

	## Introduction

	XDoc is a unified pre-trained model that deals with different document formats in a single model. With only 36.7% parameters, XDoc achieves comparable or better performance on downstream tasks, which is cost-effective for real-world deployment.

	[XDoc: Unified Pre-training for Cross-Format Document Understanding](https://arxiv.org/abs/2210.02849)
	Jingye Chen, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei, [EMNLP 2022](#)

	The overview of our framework is as follows:

	<div align="center">
	<img src="./architecture.png" width="100%" height="100%" />
	</div>

	## Download


	### Pre-trained Model
	\| Model \| Download \|
	\| -------- \| -------- \|
	\| xdoc-pretrain-roberta-1M \| [xdoc-base](https://huggingface.co/microsoft/xdoc-base) \|

	### Fine-tuning Models
	\| Model \| Download \|
	\| -------- \| -------- \|
	\| xdoc-squad1.1 \| [xdoc-squad1.1](https://huggingface.co/microsoft/xdoc-base-squad1.1) \|
	\| xdoc-squad2.0 \| [xdoc-squad2.0](https://huggingface.co/microsoft/xdoc-base-squad2.0) \|
	\| xdoc-funsd \| [xdoc-funsd](https://huggingface.co/microsoft/xdoc-base-funsd) \|
	\| xdoc-websrc \| [xdoc-websrc](https://huggingface.co/microsoft/xdoc-base-websrc) \|



	## Fine-tune


	### SQuAD
	The dataset will be automatically downloaded. Please refer to ```./fine_tuning/squad/```.

	#### Installation
	```
	pip install -r requirements.txt
	```

	#### Train
	To train XDoc on SQuADv1.1

	```bash
	CUDA_VISIBLE_DEVICES=0 python run_squad.py \
	--model_name_or_path microsoft/xdoc-base \
	--dataset_name squad \
	--do_train \
	--do_eval \
	--per_device_train_batch_size 16 \
	--learning_rate 3e-5 \
	--num_train_epochs 2 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir ./v1_result \
	--overwrite_output_dir
	```

	To train XDoc on SQuADv2.0

	```bash
	CUDA_VISIBLE_DEVICES=0 python run_squad.py \
	--model_name_or_path microsoft/xdoc-base \
	--dataset_name squad_v2 \
	--do_train \
	--do_eval \
	--version_2_with_negative \
	--per_device_train_batch_size 16 \
	--learning_rate 3e-5 \
	--num_train_epochs 4 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir ./v2_result \
	--overwrite_output_dir
	```

	#### Test
	To test XDoc on SQuADv1.1


	```bash
	CUDA_VISIBLE_DEVICES=0 python run_squad.py \
	--model_name_or_path microsoft/xdoc-base-squad1.1 \
	--dataset_name squad \
	--do_eval \
	--per_device_train_batch_size 16 \
	--learning_rate 3e-5 \
	--num_train_epochs 2 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir ./squadv1.1_result \
	--overwrite_output_dir
	```

	To test XDoc on SQuADv2.0

	```bash
	CUDA_VISIBLE_DEVICES=0 python run_squad.py \
	--model_name_or_path microsoft/xdoc-base-squad2.0 \
	--dataset_name squad_v2 \
	--do_eval \
	--version_2_with_negative \
	--per_device_train_batch_size 16 \
	--learning_rate 3e-5 \
	--num_train_epochs 4 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir ./squadv2.0_result \
	--overwrite_output_dir
	```



	### FUNSD
	The dataset will be automatically downloaded. Please refer to ```./fine_tuning/funsd/```.

	#### Installation

	```bash
	pip install -r requirements.txt
	```

	Also, you need to install ```detectron2```. For example, if you use torch1.8 with cuda version 10.1, you can use the following command

	```bash
	pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.8/index.html
	```

	#### Train

	```bash
	CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
	--model_name_or_path microsoft/xdoc-base \
	--output_dir camera_ready_funsd_1M \
	--do_train \
	--do_eval \
	--max_steps 1000 \
	--warmup_ratio 0.1 \
	--fp16 \
	--overwrite_output_dir \
	--seed 42
	```

	#### Test

	```
	CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
	--model_name_or_path microsoft/xdoc-base-funsd \
	--output_dir camera_ready_funsd_1M \
	--do_eval \
	--max_steps 1000 \
	--warmup_ratio 0.1 \
	--fp16 \
	--overwrite_output_dir \
	--seed 42
	```

	### WebSRC
	The dataset will be manually downloaded. After downloading, please modify the argument ```--web_train_file```, ```--web_eval_file```, ```web_root_dir```, and ```root_dir``` in args.py.

	#### Installation

	```bash
	pip install -r requirements.txt
	```

	#### Train

	```bash
	CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train True --do_eval True --model_name_or_path microsoft/xdoc-base
	```

	#### Test
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train False --do_eval True --model_name_or_path microsoft/xdoc-base-websrc
	```


	## Result

	* To verify the model accuracy, we select the GLUE benchmark and SQuAD to evaluate plain text understanding, FUNSD and DocVQA to evaluate doc-
	ument understanding, and WebSRC for web text understanding. Experimental
	results have demonstrated that XDoc achieves comparable or even better performance on these tasks.

	\| Model \| MNLI-m \| QNLI \| SST2 \| MRPC \| SQUAD1.1/2.0 \| FUNSD \| DocVQA \| WebSRC \|
	\| :----------: \| :------: \| :----: \| :----: \| :----: \| :------------: \| :-----: \| :------: \| :------: \|
	\| RoBERTa \| 87.6 \| 92.8 \| 94.8 \| 90.2 \| 92.2/83.4 \| - \| - \| - \|
	\| LayoutLM \| - \| - \| - \| - \| - \| 79.3 \| 69.2 \| - \|
	\| MarkupLM \| - \| - \| - \| - \| - \| - \| - \| 74.5 \|
	\| XDoc(Ours) \| 86.8 \| 92.3 \| 95.3 \| 91.1 \| 92.0/83.5 \| 89.4 \| 72.7 \| 74.8 \|

	* With only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment.

	\| Model \| Word \| 1D Position \| Transformer \| 2D Position \| XPath \| Adaptive \| Total \|
	\| :----------: \| :----: \| :-----------: \| :-----------: \| :-----------: \| :-----: \| :--------: \| :-----: \|
	\| RoBERTa \| √ \| √ \| √ \| - \| - \| - \| 128M \|
	\| LayoutLM \| √ \| √ \| √ \| √ \| - \| - \| 131M \|
	\| MarkupLM \| √ \| √ \| √ \| - \| √ \| - \| 139M \|
	\| XDoc(Ours) \| √ \| √ \| √ \| √ \| √ \| √ \| 146M \|



	## Citation

	If you find XDoc helpful, please cite us:
	```
	@article{chen2022xdoc,
	title={XDoc: Unified Pre-training for Cross-Format Document Understanding},
	author={Chen, Jingye and Lv, Tengchao and Cui, Lei and Zhang, Cha and Wei, Furu},
	journal={arXiv preprint arXiv:2210.02849},
	year={2022}
	}
	```


	## License

	This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
	Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers).
	[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

	## Contact

	For help or issues using XDoc, please submit a GitHub issue.

	For other communications, please contact Lei Cui (`lecu@microsoft.com`), Furu Wei (`fuwei@microsoft.com`).