🏦 Multi-Format Finance Document Parser

A production-grade financial document parser fine-tuned on Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization). Given raw text from any financial document, it outputs structured JSON — ready for downstream processing, ERP integration, or analytics pipelines.

🚀 Live Demo

👉 Try it on HuggingFace Spaces

📄 Supported Document Types

Format	Examples
Invoice	Vendor invoices, GST bills, service bills
SAP Report	ALV exports, FI vendor payment reports
Income Statement	P&L statements, quarterly earnings
Balance Sheet	Assets, liabilities, equity statements
Bank Statement	Transaction records, account summaries
Purchase Order	PO documents, procurement records
SQL Result	Query outputs from finance databases
CSV / Excel	Tabular finance data

🧠 Model Details

Property	Value
Base model	Qwen/Qwen2.5-7B-Instruct
Model size	8B parameters
Fine-tuning method	QLoRA (PEFT)
Quantization	4-bit NF4 + double quantization
Compute dtype	bfloat16
LoRA rank	r=8, alpha=16
Max sequence length	512 tokens
Training hardware	L40S 48GB GPU (Lightning AI)
Training time	~1 hour
License	Apache 2.0

📊 Training Data

Dataset	Samples	Type
CORD-v2	454	Real receipt images + structured JSON
Synthetic invoices	300	Generated with realistic Indian/global vendors
Synthetic SAP reports	100	ALV-style pipe-delimited exports
Synthetic income statements	100	P&L with revenue, COGS, EBIT, net income
Total	954	Train: 812 · Eval: 95 · Test: 47

⚙️ Quantization Techniques

Technique	Purpose
NF4 4-bit quantization	Stores weights in 4-bit NormalFloat format — ~4x model size reduction
Double quantization	Quantizes the quantization constants — additional ~0.4 bits/param saving
bfloat16 compute	Full precision operations, 4-bit storage
LoRA adapters (r=8)	Only 0.5% of parameters trained — 99.5% frozen
Paged AdamW 8-bit	Optimizer state memory reduction
Gradient checkpointing	~40% activation memory reduction

📤 Output Schema

{
  "document_type": "invoice|balance_sheet|income_stmt|sap_report|sql_result|bank_statement|purchase_order",
  "vendor": "string or null",
  "client": "string or null",
  "date": "YYYY-MM-DD or null",
  "due_date": "YYYY-MM-DD or null",
  "document_id": "string or null",
  "currency": "USD|EUR|INR|GBP|...",
  "subtotal": "float or null",
  "tax_amount": "float or null",
  "tax_rate_pct": "float or null",
  "total_amount": "float or null",
  "line_items": [
    {
      "description": "string",
      "quantity": "float or null",
      "unit_price": "float or null",
      "amount": "float"
    }
  ],
  "payment_terms": "string or null",
  "notes": "string or null",
  "metadata": {}
}

💻 Usage

Via HuggingFace Inference API

import requests
import json
import re

API_URL = "https://api-inference.huggingface.co/models/ratulsur/multi-format-finance-parser"
HF_TOKEN = "hf_xxxxxxxxxxxx"

SYSTEM_PROMPT = """You are a production financial document parser.
Given raw text from any financial document, output ONLY a single valid JSON object.
Schema: {document_type, vendor, client, date (YYYY-MM-DD), due_date, document_id,
currency, subtotal, tax_amount, tax_rate_pct, total_amount,
line_items:[{description,quantity,unit_price,amount}], payment_terms, notes, metadata}.
All monetary values must be floats. Unknown fields → null. No explanation."""

def parse_document(text: str) -> dict:
    prompt = (
        f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
        f"<|im_start|>user\nParse this financial document:\n\n{text}<|im_end|>\n"
        f"<|im_start|>assistant\n"
    )
    headers = {"Authorization": f"Bearer {HF_TOKEN}"}
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 512,
            "temperature": 0.05,
            "return_full_text": False,
            "do_sample": False,
        }
    }
    resp = requests.post(API_URL, headers=headers, json=payload, timeout=120)
    raw = resp.json()[0]["generated_text"].strip()
    raw = re.sub(r"```json\s*|```\s*", "", raw).strip()
    return json.loads(raw)

# Example
invoice = """
INVOICE
Vendor: Tata Consultancy Services Ltd.
Invoice No: TCS-2024-8821
Date: 2024-11-15
Service: Cloud Infrastructure Management   INR 42,500.00
GST @ 18%:                                 INR  7,650.00
TOTAL DUE:                                 INR 50,150.00
Payment Terms: Net 30
"""

result = parse_document(invoice)
print(json.dumps(result, indent=2))

Expected output

{
  "document_type": "invoice",
  "vendor": "Tata Consultancy Services Ltd.",
  "client": null,
  "date": "2024-11-15",
  "due_date": null,
  "document_id": "TCS-2024-8821",
  "currency": "INR",
  "subtotal": 42500.0,
  "tax_amount": 7650.0,
  "tax_rate_pct": 18.0,
  "total_amount": 50150.0,
  "line_items": [
    {
      "description": "Cloud Infrastructure Management",
      "quantity": 1,
      "unit_price": 42500.0,
      "amount": 42500.0
    }
  ],
  "payment_terms": "Net 30",
  "notes": null,
  "metadata": {}
}

Load locally with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "ratulsur/multi-format-finance-parser",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "ratulsur/multi-format-finance-parser",
    trust_remote_code=True,
)

🏗️ Training Setup

# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Training args
SFTConfig(
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    bf16=True,
    gradient_checkpointing=True,
    max_length=512,
)

📁 Repository Structure

ratulsur/multi-format-finance-parser/
├── model.safetensors      # Merged model weights (15.2 GB)
├── config.json            # Model configuration
├── tokenizer.json         # Tokenizer
├── tokenizer_config.json  # Tokenizer configuration
├── chat_template.jinja    # Chat template
└── generation_config.json # Generation configuration

⚠️ Limitations

Trained primarily on English financial documents
Best performance on structured text (not handwritten documents)
OCR quality affects accuracy for scanned documents
SAP reports tested on ALV-style exports only
954 training samples — production use should involve more data

🔗 Links

Live Demo: HuggingFace Spaces
Base Model: Qwen2.5-7B-Instruct
Training Dataset: CORD-v2

👤 Author

Ratul Sur

HuggingFace: ratulsur

If you find this model useful, please give it a ⭐ like on HuggingFace!

Downloads last month: 104

Safetensors

Model size

8B params

Tensor type

F16

Model tree for ratulsur/multi-format-finance-parser

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct