qwen-receipt-extractor

LoRA fine-tuned Qwen2.5-0.5B-Instruct for structured JSON extraction from noisy OCR receipts and invoices.

GitHub: avatar63/llm-doc-extract


Model description

Fine-tuned on ~2040 noisy OCR receipt/invoice examples using LoRA (rank 16). Takes raw, garbled OCR text and extracts structured JSON with company name, address, date, total amount, and line items.

Trained to run entirely locally โ€” no API calls at inference time. Documents never leave your machine.


Performance

Evaluated on 204 held-out examples vs base Qwen2.5-0.5B-Instruct:

Field Baseline Fine-tuned ฮ”
JSON valid 80.9% 99.5% +18.6%
Company name 38.2% 46.6% +8.3%
Date 15.2% 83.8% +68.6%
Total amount 0.0% 99.0% +99.0%
Line items 58.5% 97.0% +38.5%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
import json

BASE_MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
ADAPTER = "avatar63/qwen-receipt-extractor"

INSTRUCTION = (
    "Extract the following fields from the OCR text as JSON: "
    "company_name, address, date, total_amount, line_items "
    "(each with item_name, quantity, price). "
    "Use null for any field that cannot be determined."
)

tokenizer = AutoTokenizer.from_pretrained(ADAPTER, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, dtype=torch.float16, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

noisy_text = """
RELI4NCE FR3SH
Sh0p N0 12, 5ect0r 18
D4te: O5-ll-2O24
Net P4y4ble: 34O.OO
"""

messages = [
    {"role": "system", "content": INSTRUCTION},
    {"role": "user", "content": noisy_text}
]

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated = outputs[0][inputs["input_ids"].shape[1]:]
result = tokenizer.decode(generated, skip_special_tokens=True)
print(json.loads(result))

Demo Space: https://huggingface.co/spaces/avatar63/receipt-extractor-demo

Training details

Parameter Value
Base model Qwen/Qwen2.5-0.5B-Instruct
Method LoRA via HuggingFace PEFT + TRL
LoRA rank 16
LoRA alpha 32
Trainable parameters 8.8M / 502M (1.75%)
Training examples ~2040
Epochs 3
Learning rate 2e-4
Hardware RTX 3060 12GB
Training time ~28 minutes

Limitations

  • Partial character-level denoising โ€” item names and company suffixes may retain some OCR noise
  • Address hallucination on sparse/ambiguous inputs
  • Net payable vs subtotal ambiguity on some receipts
  • Trained primarily on Malaysian and synthetic English receipts

Datasets

Base model

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for avatar63/qwen-receipt-extractor

Adapter
(642)
this model

Space using avatar63/qwen-receipt-extractor 1