File size: 2,354 Bytes
94e735e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝 time to dive in and learn more 🧶 

![image_1](image_1.jpeg)

This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM 

![image_2](image_2.jpeg)

Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM 

![image_3](image_3.jpeg)

Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs 🤓 

![image_4](image_4.jpeg)

They train the model various downstream tasks including:  
- document understanding (DUE benchmark and more)  
- table parsing (TURL, PubTabNet)  
- chart parsing (PlotQA and more)  
- image parsing (OCR-CC)  
- text localization (DocVQA and more) 

![image_5](image_5.jpeg)

They contribute a new model called DocOwl 1.5-Chat by:  
1. creating a new document-chat dataset with questions from document VQA datasets  
2. feeding them to ChatGPT to get long answers  
3. fine-tune the base model with it (which IMO works very well!) 

![image_6](image_6.jpeg)

Resulting generalist model and the chat model are pretty much state-of-the-art 😍 Below you can see how it compares to fine-tuned models 

![image_7](image_7.jpeg)

Very good paper, read it [here](https://t.co/T23JOAPkv1).  
All the models and the datasets (also some eval datasets on above tasks!) are in this [organization](https://t.co/sJdTw1jWTR).  
The [Space](https://t.co/57E9DbNZXf). 

Thanks a lot for reading! 

![image_8](image_8.jpeg)

> [!TIP]
Ressources:  
[mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895) 
by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024) 
[GitHub](https://github.com/X-PLUG/mPLUG-DocOwl)  

> [!NOTE]
[Original tweet](https://twitter.com/mervenoyann/status/1782421257591357824) (April 22, 2024)