Edit model card

Model Card for DoNUT Model

This model card provides details about the DoNUT model fine-tuned for document question answering (docQA) on a synthetically generated dataset.

Model Details

Model Description

The DoNUT model is a document question answering model that has been fine-tuned for answering questions related to tax forms, specifically 1099-div, 1099-int, w2, and w3 forms. It has been trained on a synthetically generated dataset to achieve high accuracy and performance in identifying and extracting information from these forms.

Developed by: [CALM.ai]
Model type: Question Answering (QA)
Language(s) (NLP): English
License: Apache-2.0
Finetuned from model : DoNUT Model

model image

Model Sources

  • Repository: naver-clova-ix/donut-base
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

Direct Use

The model can be directly used for querying tax forms and extracting information from them. Users can interact with the extracted information using the llama-3 LLM, which provides a better understanding of the forms and allows for simple mathematical operations on some fields.

General Purpose Use

The model can also be used as a general-purpose document question answering system. It can parse various types of documents, such as textbooks, magazines, articles, and technical papers, providing users with relevant information and insights.

Downstream Use

The model can be further fine-tuned for specific use cases or integrated into larger document processing systems. It can also be used for classifying uploaded documents into form documents (1099-DIV, 1099-INT, W2, W3) and non-form documents (non-form). This allows for general-purpose use, such as parsing textbooks, magazines, articles, technical papers, etc.

Out-of-Scope Use

The model is not suitable for non-tax-related documents and may not perform well on handwritten or poorly scanned forms.

Bias, Risks, and Limitations

The model may exhibit biases based on the synthetic nature of the dataset and may not generalize well to real-world scenarios. It may also struggle with handwritten or poorly scanned forms.

How to Get Started with the Model

To get started with the model, you can use the following code:

Installing reqired libraries

!pip install -q transformers\
datasets

Loading the Dataset

from datasets import load_dataset


dataset = load_dataset("calm-ai/Multiple_financial_forms", split="test", use_auth_token=True)

Loading the Model


from transformers import DonutProcessor, VisionEncoderDecoderModel

processor = DonutProcessor.from_pretrained("calm-ai/donut-base-finetuned-forms-v1")
model = VisionEncoderDecoderModel.from_pretrained("calm-ai/donut-base-finetuned-forms-v1")

Use the model for inference

import re
import json
import torch
from tqdm.auto import tqdm
import numpy as np

def process_document(image):
    # prepare encoder inputs
    pixel_values = processor(image, return_tensors="pt").pixel_values
    
    print(type(pixel_values),pixel_values.shape)

    # prepare decoder inputs
    task_prompt = "<s>"
    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids

    # generate answer
    outputs = model.generate(
        pixel_values.to(torch.device(1)),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=True,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )

    # postprocess
    sequence = processor.batch_decode(outputs.sequences)[0]
    sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token

    return processor.token2json(sequence)

#youcan change the index number between 0-99 and check the parsed information
image = dataset[20]['image']

image

Training Details

Training Data

The model was trained on a synthetically generated dataset consisting of 4000 tax forms (1099-div, 1099-int, w2, w3) with complete data imputed using the Faker library.

Training Procedure

Preprocessing.

The forms were preprocessed to extract text and annotating information for training.

Training Hyperparameters

  • Training regime: Fine-tuning on the DoNUT model Optimizer: Adam Learning rate: 5e-5 Batch size: 8

Speeds, Sizes, Time

Training time: 3 epochs
Speed: 6s

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on a separate set of tax forms not seen during training.

Factors

The evaluation was disaggregated by form type (1099-div, 1099-int, w2, w3).

Metrics

Val_edit_distance: 0.0434

Val Edit distance is a measure of similarity between two strings, calculated as the minimum number of operations required to transform one string into the other. In the context of document parsing and generation, edit distance can be used to measure the accuracy of the generated output compared to the ground truth.

Here's why val-edit-distance may be a suitable metric for this purpose:

Quantifies Accuracy: Edit distance provides a quantitative measure of how similar the generated JSON output is to the ground truth. A lower edit distance indicates a higher degree of accuracy in the generated output.

Handles Variability: Edit distance is robust to variations in the generated output that may still be considered correct. For example, minor differences in formatting or word choice may result in a small edit distance but still be acceptable.

Easy Interpretation: The edit distance value is easy to interpret, with smaller values indicating higher similarity between the generated and ground truth outputs.

Results

Accuracy: 97%

Summary

Our DoNUT finetunde model is the only open-source model capable of extracting information from tax forms such as 1099-div, 1099-int, w2, and w3, achieving an accuracy of 97%.

Technical Specifications.

Compute Infrastructure

GPU requirements : (min) 4gb System Ram : (min) 8gb

Model Card Authors

Abhishek A Chandan V K Likhith V Monish M

Model Card Contact

Downloads last month
19
Safetensors
Model size
202M params
Tensor type
I64
·
F32
·