๐ฆ Multi-Format Finance Document Parser
A production-grade financial document parser fine-tuned on Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization). Given raw text from any financial document, it outputs structured JSON โ ready for downstream processing, ERP integration, or analytics pipelines.
๐ Live Demo
๐ Try it on HuggingFace Spaces
๐ Supported Document Types
| Format | Examples |
|---|---|
| Invoice | Vendor invoices, GST bills, service bills |
| SAP Report | ALV exports, FI vendor payment reports |
| Income Statement | P&L statements, quarterly earnings |
| Balance Sheet | Assets, liabilities, equity statements |
| Bank Statement | Transaction records, account summaries |
| Purchase Order | PO documents, procurement records |
| SQL Result | Query outputs from finance databases |
| CSV / Excel | Tabular finance data |
๐ง Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-7B-Instruct |
| Model size | 8B parameters |
| Fine-tuning method | QLoRA (PEFT) |
| Quantization | 4-bit NF4 + double quantization |
| Compute dtype | bfloat16 |
| LoRA rank | r=8, alpha=16 |
| Max sequence length | 512 tokens |
| Training hardware | L40S 48GB GPU (Lightning AI) |
| Training time | ~1 hour |
| License | Apache 2.0 |
๐ Training Data
| Dataset | Samples | Type |
|---|---|---|
| CORD-v2 | 454 | Real receipt images + structured JSON |
| Synthetic invoices | 300 | Generated with realistic Indian/global vendors |
| Synthetic SAP reports | 100 | ALV-style pipe-delimited exports |
| Synthetic income statements | 100 | P&L with revenue, COGS, EBIT, net income |
| Total | 954 | Train: 812 ยท Eval: 95 ยท Test: 47 |
โ๏ธ Quantization Techniques
| Technique | Purpose |
|---|---|
| NF4 4-bit quantization | Stores weights in 4-bit NormalFloat format โ ~4x model size reduction |
| Double quantization | Quantizes the quantization constants โ additional ~0.4 bits/param saving |
| bfloat16 compute | Full precision operations, 4-bit storage |
| LoRA adapters (r=8) | Only 0.5% of parameters trained โ 99.5% frozen |
| Paged AdamW 8-bit | Optimizer state memory reduction |
| Gradient checkpointing | ~40% activation memory reduction |
๐ค Output Schema
{
"document_type": "invoice|balance_sheet|income_stmt|sap_report|sql_result|bank_statement|purchase_order",
"vendor": "string or null",
"client": "string or null",
"date": "YYYY-MM-DD or null",
"due_date": "YYYY-MM-DD or null",
"document_id": "string or null",
"currency": "USD|EUR|INR|GBP|...",
"subtotal": "float or null",
"tax_amount": "float or null",
"tax_rate_pct": "float or null",
"total_amount": "float or null",
"line_items": [
{
"description": "string",
"quantity": "float or null",
"unit_price": "float or null",
"amount": "float"
}
],
"payment_terms": "string or null",
"notes": "string or null",
"metadata": {}
}
๐ป Usage
Via HuggingFace Inference API
import requests
import json
import re
API_URL = "https://api-inference.huggingface.co/models/ratulsur/multi-format-finance-parser"
HF_TOKEN = "hf_xxxxxxxxxxxx"
SYSTEM_PROMPT = """You are a production financial document parser.
Given raw text from any financial document, output ONLY a single valid JSON object.
Schema: {document_type, vendor, client, date (YYYY-MM-DD), due_date, document_id,
currency, subtotal, tax_amount, tax_rate_pct, total_amount,
line_items:[{description,quantity,unit_price,amount}], payment_terms, notes, metadata}.
All monetary values must be floats. Unknown fields โ null. No explanation."""
def parse_document(text: str) -> dict:
prompt = (
f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
f"<|im_start|>user\nParse this financial document:\n\n{text}<|im_end|>\n"
f"<|im_start|>assistant\n"
)
headers = {"Authorization": f"Bearer {HF_TOKEN}"}
payload = {
"inputs": prompt,
"parameters": {
"max_new_tokens": 512,
"temperature": 0.05,
"return_full_text": False,
"do_sample": False,
}
}
resp = requests.post(API_URL, headers=headers, json=payload, timeout=120)
raw = resp.json()[0]["generated_text"].strip()
raw = re.sub(r"```json\s*|```\s*", "", raw).strip()
return json.loads(raw)
# Example
invoice = """
INVOICE
Vendor: Tata Consultancy Services Ltd.
Invoice No: TCS-2024-8821
Date: 2024-11-15
Service: Cloud Infrastructure Management INR 42,500.00
GST @ 18%: INR 7,650.00
TOTAL DUE: INR 50,150.00
Payment Terms: Net 30
"""
result = parse_document(invoice)
print(json.dumps(result, indent=2))
Expected output
{
"document_type": "invoice",
"vendor": "Tata Consultancy Services Ltd.",
"client": null,
"date": "2024-11-15",
"due_date": null,
"document_id": "TCS-2024-8821",
"currency": "INR",
"subtotal": 42500.0,
"tax_amount": 7650.0,
"tax_rate_pct": 18.0,
"total_amount": 50150.0,
"line_items": [
{
"description": "Cloud Infrastructure Management",
"quantity": 1,
"unit_price": 42500.0,
"amount": 42500.0
}
],
"payment_terms": "Net 30",
"notes": null,
"metadata": {}
}
Load locally with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"ratulsur/multi-format-finance-parser",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"ratulsur/multi-format-finance-parser",
trust_remote_code=True,
)
๐๏ธ Training Setup
# QLoRA config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# Training args
SFTConfig(
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=2e-4,
lr_scheduler_type="cosine",
optim="paged_adamw_8bit",
bf16=True,
gradient_checkpointing=True,
max_length=512,
)
๐ Repository Structure
ratulsur/multi-format-finance-parser/
โโโ model.safetensors # Merged model weights (15.2 GB)
โโโ config.json # Model configuration
โโโ tokenizer.json # Tokenizer
โโโ tokenizer_config.json # Tokenizer configuration
โโโ chat_template.jinja # Chat template
โโโ generation_config.json # Generation configuration
โ ๏ธ Limitations
- Trained primarily on English financial documents
- Best performance on structured text (not handwritten documents)
- OCR quality affects accuracy for scanned documents
- SAP reports tested on ALV-style exports only
- 954 training samples โ production use should involve more data
๐ Links
- Live Demo: HuggingFace Spaces
- Base Model: Qwen2.5-7B-Instruct
- Training Dataset: CORD-v2
๐ค Author
Ratul Sur
- HuggingFace: ratulsur
If you find this model useful, please give it a โญ like on HuggingFace!
- Downloads last month
- 104