NER in other types of documents

#1
by marcelovera - opened

Is it possible to do NER on other types of documents, such as legal texts, and in languages such as Spanish? If so, how should it be trained?

Definitely! As long as you have an (image, text) pair dataset, you can fine-tune Donut to generate whatever you like.

However if you're planning to fine-tune on a different language than English/Chinese/Korean/Japanese, it makes sense to first pre-train the model on that language (since donut-base is only pre-trained on English, Chinese, Korean, Japanese). The authors create synthetic data for pre-training, as explained here: https://github.com/clovaai/donut#synthdog-datasets

I am using "naver-clova-ix/donut-base-finetuned-docvqa" model and want to print the full content of the result json after it reads the image without invoking any prompts or user input. I just want it to parse the image and give me the full json content. How can I achieve that, please help. I am using below code:

import re
import gradio as gr

import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

def process_document(image, question):

prepare encoder inputs

pixel_values = processor(image, return_tensors="pt").pixel_values
print(pixel_values)

prepare decoder inputs

task_prompt = "{user_input}"
prompt = task_prompt.replace("{user_input}", question)
decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids
print(decoder_input_ids)

generate answer

outputs = model.generate(
pixel_values.to(device),
decoder_input_ids=decoder_input_ids.to(device),
max_length=model.decoder.config.max_position_embeddings,
early_stopping=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=1,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True,
)

postprocess

sequence = processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token

json_content = processor.token2json(sequence)
print(json_content) # Print the full JSON content

return json_content
#return processor.token2json(sequence)
description = "Gradio Demo for Donut, an instance of VisionEncoderDecoderModel fine-tuned on DocVQA (document visual question answering). To use it, simply upload your image and type a question and click 'submit', or click one of the examples to load them. Read more at the links below."
article = "

Donut: OCR-free Document Understanding Transformer | Github Repo

"
demo = gr.Interface(
fn=process_document,
inputs=["image", "text"],
outputs="json",
title="Demo: Donut ๐Ÿฉ for DocVQA",
description=description,
article=article,
enable_queue=True,
examples=[["example_1.png", "When is the coffee break?"], ["example_2.jpeg", "What's the population of Stoddard?"]],
cache_examples=False)

demo.launch()

Sign up or log in to comment