---
license: mit
tags:
- vision
inference: false
---

# UDOP model

The UDOP model was proposed in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal.

## Model description

UDOP adopts an encoder-decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering.

## Intended uses & limitations

You can use the model for document image classification, document parsing and document visual question answering (DocVQA).

### How to use

Here's how to use the model for one-shot semantic segmentation:

```python
from transformers import AutoProcessor, UdopForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = AutoProcessor.from_pretrained("microsoft/udop-large-512", apply_ocr=False)
model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large-512")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
words = example["tokens"]
boxes = example["bboxes"]
question = "Question answering. What is the date on the form?"
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")

# autoregressive generation
predicted_ids = model.generate(**encoding)
print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])
9/30/92
```

### BibTeX entry and citation info

```bibtex
@misc{tang2023unifying,
      title={Unifying Vision, Text, and Layout for Universal Document Processing}, 
      author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal},
      year={2023},
      eprint={2212.02623},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```