๐Ÿฆ Multi-Format Finance Document Parser

A production-grade financial document parser fine-tuned on Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization). Given raw text from any financial document, it outputs structured JSON โ€” ready for downstream processing, ERP integration, or analytics pipelines.


๐Ÿš€ Live Demo

๐Ÿ‘‰ Try it on HuggingFace Spaces


๐Ÿ“„ Supported Document Types

Format Examples
Invoice Vendor invoices, GST bills, service bills
SAP Report ALV exports, FI vendor payment reports
Income Statement P&L statements, quarterly earnings
Balance Sheet Assets, liabilities, equity statements
Bank Statement Transaction records, account summaries
Purchase Order PO documents, procurement records
SQL Result Query outputs from finance databases
CSV / Excel Tabular finance data

๐Ÿง  Model Details

Property Value
Base model Qwen/Qwen2.5-7B-Instruct
Model size 8B parameters
Fine-tuning method QLoRA (PEFT)
Quantization 4-bit NF4 + double quantization
Compute dtype bfloat16
LoRA rank r=8, alpha=16
Max sequence length 512 tokens
Training hardware L40S 48GB GPU (Lightning AI)
Training time ~1 hour
License Apache 2.0

๐Ÿ“Š Training Data

Dataset Samples Type
CORD-v2 454 Real receipt images + structured JSON
Synthetic invoices 300 Generated with realistic Indian/global vendors
Synthetic SAP reports 100 ALV-style pipe-delimited exports
Synthetic income statements 100 P&L with revenue, COGS, EBIT, net income
Total 954 Train: 812 ยท Eval: 95 ยท Test: 47

โš™๏ธ Quantization Techniques

Technique Purpose
NF4 4-bit quantization Stores weights in 4-bit NormalFloat format โ€” ~4x model size reduction
Double quantization Quantizes the quantization constants โ€” additional ~0.4 bits/param saving
bfloat16 compute Full precision operations, 4-bit storage
LoRA adapters (r=8) Only 0.5% of parameters trained โ€” 99.5% frozen
Paged AdamW 8-bit Optimizer state memory reduction
Gradient checkpointing ~40% activation memory reduction

๐Ÿ“ค Output Schema

{
  "document_type": "invoice|balance_sheet|income_stmt|sap_report|sql_result|bank_statement|purchase_order",
  "vendor": "string or null",
  "client": "string or null",
  "date": "YYYY-MM-DD or null",
  "due_date": "YYYY-MM-DD or null",
  "document_id": "string or null",
  "currency": "USD|EUR|INR|GBP|...",
  "subtotal": "float or null",
  "tax_amount": "float or null",
  "tax_rate_pct": "float or null",
  "total_amount": "float or null",
  "line_items": [
    {
      "description": "string",
      "quantity": "float or null",
      "unit_price": "float or null",
      "amount": "float"
    }
  ],
  "payment_terms": "string or null",
  "notes": "string or null",
  "metadata": {}
}

๐Ÿ’ป Usage

Via HuggingFace Inference API

import requests
import json
import re

API_URL = "https://api-inference.huggingface.co/models/ratulsur/multi-format-finance-parser"
HF_TOKEN = "hf_xxxxxxxxxxxx"

SYSTEM_PROMPT = """You are a production financial document parser.
Given raw text from any financial document, output ONLY a single valid JSON object.
Schema: {document_type, vendor, client, date (YYYY-MM-DD), due_date, document_id,
currency, subtotal, tax_amount, tax_rate_pct, total_amount,
line_items:[{description,quantity,unit_price,amount}], payment_terms, notes, metadata}.
All monetary values must be floats. Unknown fields โ†’ null. No explanation."""

def parse_document(text: str) -> dict:
    prompt = (
        f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
        f"<|im_start|>user\nParse this financial document:\n\n{text}<|im_end|>\n"
        f"<|im_start|>assistant\n"
    )
    headers = {"Authorization": f"Bearer {HF_TOKEN}"}
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 512,
            "temperature": 0.05,
            "return_full_text": False,
            "do_sample": False,
        }
    }
    resp = requests.post(API_URL, headers=headers, json=payload, timeout=120)
    raw = resp.json()[0]["generated_text"].strip()
    raw = re.sub(r"```json\s*|```\s*", "", raw).strip()
    return json.loads(raw)

# Example
invoice = """
INVOICE
Vendor: Tata Consultancy Services Ltd.
Invoice No: TCS-2024-8821
Date: 2024-11-15
Service: Cloud Infrastructure Management   INR 42,500.00
GST @ 18%:                                 INR  7,650.00
TOTAL DUE:                                 INR 50,150.00
Payment Terms: Net 30
"""

result = parse_document(invoice)
print(json.dumps(result, indent=2))

Expected output

{
  "document_type": "invoice",
  "vendor": "Tata Consultancy Services Ltd.",
  "client": null,
  "date": "2024-11-15",
  "due_date": null,
  "document_id": "TCS-2024-8821",
  "currency": "INR",
  "subtotal": 42500.0,
  "tax_amount": 7650.0,
  "tax_rate_pct": 18.0,
  "total_amount": 50150.0,
  "line_items": [
    {
      "description": "Cloud Infrastructure Management",
      "quantity": 1,
      "unit_price": 42500.0,
      "amount": 42500.0
    }
  ],
  "payment_terms": "Net 30",
  "notes": null,
  "metadata": {}
}

Load locally with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "ratulsur/multi-format-finance-parser",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "ratulsur/multi-format-finance-parser",
    trust_remote_code=True,
)

๐Ÿ—๏ธ Training Setup

# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Training args
SFTConfig(
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    bf16=True,
    gradient_checkpointing=True,
    max_length=512,
)

๐Ÿ“ Repository Structure

ratulsur/multi-format-finance-parser/
โ”œโ”€โ”€ model.safetensors      # Merged model weights (15.2 GB)
โ”œโ”€โ”€ config.json            # Model configuration
โ”œโ”€โ”€ tokenizer.json         # Tokenizer
โ”œโ”€โ”€ tokenizer_config.json  # Tokenizer configuration
โ”œโ”€โ”€ chat_template.jinja    # Chat template
โ””โ”€โ”€ generation_config.json # Generation configuration

โš ๏ธ Limitations

  • Trained primarily on English financial documents
  • Best performance on structured text (not handwritten documents)
  • OCR quality affects accuracy for scanned documents
  • SAP reports tested on ALV-style exports only
  • 954 training samples โ€” production use should involve more data

๐Ÿ”— Links


๐Ÿ‘ค Author

Ratul Sur


If you find this model useful, please give it a โญ like on HuggingFace!

Downloads last month
104
Safetensors
Model size
8B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ratulsur/multi-format-finance-parser

Base model

Qwen/Qwen2.5-7B
Finetuned
(2617)
this model

Space using ratulsur/multi-format-finance-parser 1