lbourdois's picture
Upload 174 files
94e735e verified
|
raw
history blame
2.35 kB

DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝 time to dive in and learn more 🧶

image_1

This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM

image_2

Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM

image_3

Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs 🤓

image_4

They train the model various downstream tasks including:

  • document understanding (DUE benchmark and more)
  • table parsing (TURL, PubTabNet)
  • chart parsing (PlotQA and more)
  • image parsing (OCR-CC)
  • text localization (DocVQA and more)

image_5

They contribute a new model called DocOwl 1.5-Chat by:

  1. creating a new document-chat dataset with questions from document VQA datasets
  2. feeding them to ChatGPT to get long answers
  3. fine-tune the base model with it (which IMO works very well!)

image_6

Resulting generalist model and the chat model are pretty much state-of-the-art 😍 Below you can see how it compares to fine-tuned models

image_7

Very good paper, read it here.
All the models and the datasets (also some eval datasets on above tasks!) are in this organization.
The Space.

Thanks a lot for reading!

image_8

Ressources:
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024) GitHub

Original tweet (April 22, 2024)