hsarfraz's picture
Update README.md
c269efd verified
metadata
license: mit
base_model: naver-clova-ix/donut-base
library_name: transformers
tags:
  - donut
  - parser
  - irs
  - tax
  - document AI
  - '1040'

Donut - fine-tuned for US IRS Form 1040 (2023) data parsing and extraction

This donut model has been fine-tuned to parse and extract data from IRS (US) tax form 1040 (year 2023). It performs OCR and returns extracted data in JSON format using zero shot prompt.

Model Details & Description

The base model is 'naver-clova-ix/donut-base', the model is finetuned for data parsing and extraction. The added_tokens.json file lists all the labels that can be extracted.

For inference use image size width: 1536 px and height: 1536 px

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch
import re

model_name = 'hsarfraz/irs-tax-form-1040-2023-doc-parser'

processor = DonutProcessor.from_pretrained(model_name)
model = VisionEncoderDecoderModel.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()

image_name = 'replace with name of the form 1040 (2023) image file '

img = Image.open(image_name)
new_width = 1536
new_height = 1536

# resize input image to finetuned images size  
img = img.resize((new_width, new_height), Image.LANCZOS)   

pixel_values = processor(img.convert("RGB"), return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)

# prompt 
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]
decoder_input_ids = decoder_input_ids.to(device)

outputs = model.generate(pixel_values,decoder_input_ids=decoder_input_ids,
                               max_length=model.decoder.config.max_position_embeddings,
                               early_stopping=True,
                               pad_token_id=processor.tokenizer.pad_token_id,
                               eos_token_id=processor.tokenizer.eos_token_id,
                               use_cache=True,
                               num_beams=1,
                               bad_words_ids=[[processor.tokenizer.unk_token_id]],
                               return_dict_in_generate=True,
                            #    output_scores=True,
                               )


sequence = processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token
output_json = processor.token2json(sequence)

print('----------------------------------')
print('--- Parsed data in json format ---')
print('----------------------------------')
print(output_json)

FAKE Synthetic Form 1040 (2023) for illustration purposes only

FAKE 1040 form for illustration purposes

Example of json output (based on FAKE 1040 form)

{
    "lbl_0_03": "Michael Evans",
    "lbl_0_04": "Caldwell",
    "lbl_0_05": "741-52-5353",
    "lbl_0_06": "None",
    "lbl_0_07": "None",
    "lbl_0_08": "None",
    "lbl_0_09": "289 Blackwell Land Suite 380 New Tiffany, NH 07548",
    "lbl_0_11": "East Amandaport",
    "lbl_0_12": "VI",
    "lbl_0_13": "47832",
    "lbl_0_14": "None",
    "lbl_0_15": "None",
    "lbl_0_16": "25677",
    "lbl_0_55": "385321.36",
    "lbl_0_56": "None",
    "lbl_0_57": "None",
    "lbl_0_58": "None",
    "lbl_0_59": "None",
    "lbl_0_60": "None",
    "lbl_0_61": "None",
    "lbl_0_62": "None",
    "lbl_0_63": "None",
    "lbl_0_67": "None",
    "lbl_0_68": "481161.23",
    "lbl_0_69": "None",
    "lbl_0_70": "None",
    "lbl_0_71": "None",
    "lbl_0_72": "749100.68",
    "lbl_0_73": "418381-6",
    "lbl_0_74": "None",
    "lbl_0_77": "755042.64",
    "lbl_0_78": "None",
    "lbl_0_79": "560928.32",
    "lbl_0_80": "493913.73",
    "lbl_0_81": "None",
    "lbl_0_82": "738597.72",
    "lbl_0_83": "34990.46"
}