udop-large-512 / README.md
nielsr's picture
nielsr HF staff
Update README.md
548952e verified
---
license: mit
tags:
- vision
inference: false
pipeline_tag: image-text-to-text
---
# UDOP model
The UDOP model was proposed in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.
## Model description
UDOP adopts an encoder-decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering.
## Intended uses & limitations
You can use the model for document image classification, document parsing and document visual question answering (DocVQA).
### How to use
Here's how to use the model for one-shot semantic segmentation:
```python
from transformers import AutoProcessor, UdopForConditionalGeneration
from datasets import load_dataset
# load model and processor
# in this case, we already have performed OCR ourselves
# so we initialize the processor with `apply_ocr=False`
processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False)
model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large")
# load an example image, along with the words and coordinates
# which were extracted using an OCR engine
dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
words = example["tokens"]
boxes = example["bboxes"]
question = "Question answering. What is the date on the form?"
# prepare everything for the model
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
# autoregressive generation
predicted_ids = model.generate(**encoding)
print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])
9/30/92
```
Refer to the [demo notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP) for fine-tuning/inference.
### BibTeX entry and citation info
```bibtex
@misc{tang2023unifying,
title={Unifying Vision, Text, and Layout for Universal Document Processing},
author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal},
year={2023},
eprint={2212.02623},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```