metadata
license: mit
base_model: naver-clova-ix/donut-base
library_name: transformers
tags:
- donut
- parser
- irs
- tax
- document AI
- '1040'
Donut - fine-tuned for US IRS Form 1040 (2023) data parsing and extraction
This donut model has been fine-tuned to parse and extract data from IRS (US) tax form 1040 (year 2023). It performs OCR and returns extracted data in JSON format using zero shot prompt.
Model Details & Description
The base model is 'naver-clova-ix/donut-base', the model is finetuned for data parsing and extraction. The added_tokens.json file lists all the labels that can be extracted.
For inference use image size width: 1536 px and height: 1536 px
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch
import re
model_name = 'hsarfraz/irs-tax-form-1040-2023-doc-parser'
processor = DonutProcessor.from_pretrained(model_name)
model = VisionEncoderDecoderModel.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
image_name = 'replace with name of the form 1040 (2023) image file '
img = Image.open(image_name)
new_width = 1536
new_height = 1536
# resize input image to finetuned images size
img = img.resize((new_width, new_height), Image.LANCZOS)
pixel_values = processor(img.convert("RGB"), return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
# prompt
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]
decoder_input_ids = decoder_input_ids.to(device)
outputs = model.generate(pixel_values,decoder_input_ids=decoder_input_ids,
max_length=model.decoder.config.max_position_embeddings,
early_stopping=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=1,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True,
# output_scores=True,
)
sequence = processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
output_json = processor.token2json(sequence)
print('----------------------------------')
print('--- Parsed data in json format ---')
print('----------------------------------')
print(output_json)
FAKE Synthetic Form 1040 (2023) for illustration purposes only
Example of json output (based on FAKE 1040 form)
{
"lbl_0_03": "Michael Evans",
"lbl_0_04": "Caldwell",
"lbl_0_05": "741-52-5353",
"lbl_0_06": "None",
"lbl_0_07": "None",
"lbl_0_08": "None",
"lbl_0_09": "289 Blackwell Land Suite 380 New Tiffany, NH 07548",
"lbl_0_11": "East Amandaport",
"lbl_0_12": "VI",
"lbl_0_13": "47832",
"lbl_0_14": "None",
"lbl_0_15": "None",
"lbl_0_16": "25677",
"lbl_0_55": "385321.36",
"lbl_0_56": "None",
"lbl_0_57": "None",
"lbl_0_58": "None",
"lbl_0_59": "None",
"lbl_0_60": "None",
"lbl_0_61": "None",
"lbl_0_62": "None",
"lbl_0_63": "None",
"lbl_0_67": "None",
"lbl_0_68": "481161.23",
"lbl_0_69": "None",
"lbl_0_70": "None",
"lbl_0_71": "None",
"lbl_0_72": "749100.68",
"lbl_0_73": "418381-6",
"lbl_0_74": "None",
"lbl_0_77": "755042.64",
"lbl_0_78": "None",
"lbl_0_79": "560928.32",
"lbl_0_80": "493913.73",
"lbl_0_81": "None",
"lbl_0_82": "738597.72",
"lbl_0_83": "34990.46"
}