Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
merveΒ 
posted an update Mar 11
Post
New foundation model on document understanding and generation in transformers 🀩
UDOP by MSFT is a bleeding-edge model that is capable of many tasks, including question answering, document editing and more! 🀯
Demo πŸ‘‰ merve/UDOP
It is a model that combines vision, text and layout. πŸ“
This model is very interesting because the input representation truly captures the nature of the document modality: text, where the text is, and the layout of the document matters!
If you know T5, it resembles that: it's pre-trained on both self-supervised and supervised objectives over text, image and layout.
To switch between tasks, one simply needs to change the task specific prompt at the beginning, e.g. for QA, one prepends with Question answering.
As for the architecture, it's like T5, except it has a single encoder that takes in text, image and layout, and two decoders (text-layout and vision decoders) combined into one.
The vision decoder is a masked autoencoder (thus the capabilities of document editing).
For me, the most interesting capability is document reconstruction, document editing and layout re-arrangement. This decoder isn't released though because it could be used maliciously to fake document editing.
Overall, the model performs very well on document understanding benchmark (DUE) and also information extraction (FUNSD, CORD) and classification (RVL-CDIP) for vision, text, layout modalities.
You can learn more about the model from below resources (h/t to
@nielsr ), thanks a lot for reading πŸ€—
Docs: https://huggingface.co/docs/transformers/main/en/model_doc/udop πŸ“š
Checkpoints: microsoft/udop-65e625124aee97415b88b513
Demo notebooks: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP πŸ“•
In this post