ZinengTang commited on
Commit
af18111
1 Parent(s): 42152ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md CHANGED
@@ -1,3 +1,89 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ # [Unifying Vision, Text, and Layout for Universal Document Processing (CVPR 2023 Highlight)](https://arxiv.org/pdf/2212.02623)
6
+ [Zineng Tang](https://zinengtang.github.io/),
7
+ [Ziyi Yang](https://ziyi-yang.github.io/),
8
+ [Guoxin Wang](https://www.guoxwang.com/),
9
+ [Yuwei Fang](https://www.microsoft.com/en-us/research/people/yuwfan/),
10
+ [Yang Liu](https://nlp-yang.github.io/),
11
+ [Chenguang Zhu](https://cs.stanford.edu/people/cgzhu/),
12
+ [Michael Zeng](https://www.microsoft.com/en-us/research/people/nzeng/),
13
+ [Cha Zhang](https://www.microsoft.com/en-us/research/people/chazhang/),
14
+ [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)
15
+
16
+
17
+ Open Source Checklist:
18
+
19
+ - [x] Release Model (Encoder + Text decoder)
20
+ - [x] Release Most Scripts
21
+ - [ ] Vision Decoder / Weights (Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API)
22
+ - [x] Demo
23
+
24
+ ## Introduction
25
+
26
+ UDOP unifies vision, text, and layout through vision-text-layout Transformer and unified generative pretraining tasks including
27
+ vision task, text task, layout task, and mixed task. We show the task prompts (left) and task targets (right) for all self-supervised objectives
28
+ (joint text-layout reconstruction, visual text recognition, layout modeling, and masked autoencoding) and two example supervised objectives
29
+ (question answering and layout analysis).
30
+
31
+ <p align="center">
32
+ <img align="middle" width="800" src="assets/udop.png"/>
33
+ </p>
34
+
35
+ ## Install
36
+ ### Setup `python` environment
37
+ ```
38
+ conda create -n UDOP python=3.8 # You can also use other environment.
39
+ ```
40
+ ### Install other dependencies
41
+ ```
42
+ pip install -r requirements.txt
43
+ ```
44
+
45
+ ## Run Scripts
46
+
47
+ Switch model type by:
48
+
49
+ --model_type "UdopDual"
50
+
51
+ --model_type "UdopUnimodel"
52
+
53
+ ### Finetuninng on RVLCDIP
54
+
55
+ Download RVLCDIP first and change the path
56
+ For OCR, you might need to customize your code
57
+ ```
58
+ bash scripts/finetune_rvlcdip.sh # Finetuning on RVLCDIP
59
+ ```
60
+
61
+ ### Finetuninng on DUE Benchmark
62
+
63
+ Download [Duebenchmark](https://github.com/due-benchmark/baselines) and follow its procedure to preprocess the data.
64
+
65
+ The training code adapted to our framework is hosted at benchmarker by running:
66
+
67
+ ```
68
+ bash scripts/finetune_duebenchmark.sh # Finetuning on DUE Benchmark, Switch tasks by changing path to the dataset
69
+ ```
70
+
71
+ Evaluation of the output generation can be evaluated by [Duebenchmark due_evaluator](https://github.com/due-benchmark/evaluator)
72
+
73
+ ### Model Checkpoints
74
+ The model checkpoints are hosted here [Huggingface Hub](https://huggingface.co/ZinengTang/Udop)
75
+
76
+ ## Citation
77
+ ```
78
+ @article{tang2022unifying,
79
+ title={Unifying Vision, Text, and Layout for Universal Document Processing},
80
+ author={Tang, Zineng and Yang, Ziyi and Wang, Guoxin and Fang, Yuwei and Liu, Yang and Zhu, Chenguang and Zeng, Michael and Zhang, Cha and Bansal, Mohit},
81
+ journal={arXiv preprint arXiv:2212.02623},
82
+ year={2022}
83
+ }
84
+ ```
85
+
86
+ ## Contact
87
+
88
+ Zineng Tang (zn.tang.terran@gmail.com)
89
+