SPARKNET / docs /DOCUMENT_INTELLIGENCE.md
MHamdan's picture
Initial commit: SPARKNET framework
d520909

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

SPARKNET Document Intelligence

A vision-first agentic document understanding platform that goes beyond OCR, supports complex layouts, and produces LLM-ready, visually grounded outputs suitable for RAG and field extraction at scale.

Overview

The Document Intelligence subsystem provides:

  • Vision-First Understanding: Treats documents as visual objects, not just text
  • Semantic Chunking: Classifies regions by type (text, table, figure, chart, form, etc.)
  • Visual Grounding: Every extraction includes evidence (page, bbox, snippet, confidence)
  • Zero-Shot Capability: Works across diverse document formats without training
  • Schema-Driven Extraction: Define fields using JSON Schema or Pydantic models
  • Abstention Policy: Never guesses - abstains when confidence is low
  • Local-First: All processing happens locally for privacy

Quick Start

Basic Parsing

from src.document_intelligence import DocumentParser, ParserConfig

# Configure parser
config = ParserConfig(
    render_dpi=200,
    max_pages=10,
    include_markdown=True,
)

parser = DocumentParser(config=config)
result = parser.parse("document.pdf")

print(f"Parsed {len(result.chunks)} chunks from {result.num_pages} pages")

# Access chunks
for chunk in result.chunks:
    print(f"[Page {chunk.page}] {chunk.chunk_type.value}: {chunk.text[:100]}...")

Field Extraction

from src.document_intelligence import (
    FieldExtractor,
    ExtractionSchema,
    create_invoice_schema,
)

# Use preset schema
schema = create_invoice_schema()

# Or create custom schema
schema = ExtractionSchema(name="CustomSchema")
schema.add_string_field("company_name", "Name of the company", required=True)
schema.add_date_field("document_date", "Date on document")
schema.add_currency_field("total_amount", "Total amount")

# Extract fields
extractor = FieldExtractor()
extraction = extractor.extract(parse_result, schema)

print("Extracted Data:")
for key, value in extraction.data.items():
    if key in extraction.abstained_fields:
        print(f"  {key}: [ABSTAINED]")
    else:
        print(f"  {key}: {value}")

print(f"Confidence: {extraction.overall_confidence:.2f}")

Visual Grounding

from src.document_intelligence import (
    load_document,
    RenderOptions,
)
from src.document_intelligence.grounding import (
    crop_region,
    create_annotated_image,
    EvidenceBuilder,
)

# Load and render page
loader, renderer = load_document("document.pdf")
page_image = renderer.render_page(1, RenderOptions(dpi=200))

# Create annotated visualization
bboxes = [chunk.bbox for chunk in result.chunks if chunk.page == 1]
labels = [chunk.chunk_type.value for chunk in result.chunks if chunk.page == 1]
annotated = create_annotated_image(page_image, bboxes, labels)

# Crop specific region
crop = crop_region(page_image, chunk.bbox, padding_percent=0.02)

Question Answering

from src.document_intelligence.tools import get_tool

qa_tool = get_tool("answer_question")
result = qa_tool.execute(
    parse_result=parse_result,
    question="What is the total amount due?",
)

if result.success:
    print(f"Answer: {result.data['answer']}")
    print(f"Confidence: {result.data['confidence']:.2f}")

    for ev in result.evidence:
        print(f"  Evidence: Page {ev['page']}, {ev['snippet'][:50]}...")

Architecture

Module Structure

src/document_intelligence/
β”œβ”€β”€ __init__.py           # Main exports
β”œβ”€β”€ chunks/               # Core data models
β”‚   β”œβ”€β”€ models.py         # BoundingBox, DocumentChunk, TableChunk, etc.
β”‚   └── __init__.py
β”œβ”€β”€ io/                   # Document loading
β”‚   β”œβ”€β”€ base.py           # Abstract interfaces
β”‚   β”œβ”€β”€ pdf.py            # PDF loading (PyMuPDF)
β”‚   β”œβ”€β”€ image.py          # Image loading (PIL)
β”‚   β”œβ”€β”€ cache.py          # Page caching
β”‚   └── __init__.py
β”œβ”€β”€ models/               # Model interfaces
β”‚   β”œβ”€β”€ base.py           # BaseModel, BatchableModel
β”‚   β”œβ”€β”€ ocr.py            # OCRModel interface
β”‚   β”œβ”€β”€ layout.py         # LayoutModel interface
β”‚   β”œβ”€β”€ table.py          # TableModel interface
β”‚   β”œβ”€β”€ chart.py          # ChartModel interface
β”‚   β”œβ”€β”€ vlm.py            # VisionLanguageModel interface
β”‚   └── __init__.py
β”œβ”€β”€ parsing/              # Document parsing
β”‚   β”œβ”€β”€ parser.py         # DocumentParser orchestrator
β”‚   β”œβ”€β”€ chunking.py       # Semantic chunking utilities
β”‚   └── __init__.py
β”œβ”€β”€ grounding/            # Visual evidence
β”‚   β”œβ”€β”€ evidence.py       # EvidenceBuilder, EvidenceTracker
β”‚   β”œβ”€β”€ crops.py          # Image cropping utilities
β”‚   └── __init__.py
β”œβ”€β”€ extraction/           # Field extraction
β”‚   β”œβ”€β”€ schema.py         # ExtractionSchema, FieldSpec
β”‚   β”œβ”€β”€ extractor.py      # FieldExtractor
β”‚   β”œβ”€β”€ validator.py      # ExtractionValidator
β”‚   └── __init__.py
β”œβ”€β”€ tools/                # Agent tools
β”‚   β”œβ”€β”€ document_tools.py # Tool implementations
β”‚   └── __init__.py
β”œβ”€β”€ validation/           # Result validation
β”‚   └── __init__.py
└── agent_adapter.py      # Agent integration

Data Models

BoundingBox

Represents a rectangular region in XYXY format:

from src.document_intelligence.chunks import BoundingBox

# Normalized coordinates (0-1)
bbox = BoundingBox(
    x_min=0.1, y_min=0.2,
    x_max=0.9, y_max=0.3,
    normalized=True
)

# Convert to pixels
pixel_bbox = bbox.to_pixel(width=1000, height=800)

# Calculate IoU
overlap = bbox1.iou(bbox2)

# Check containment
is_inside = bbox.contains((0.5, 0.25))

DocumentChunk

Base semantic chunk:

from src.document_intelligence.chunks import DocumentChunk, ChunkType

chunk = DocumentChunk(
    chunk_id="abc123",
    doc_id="doc001",
    chunk_type=ChunkType.PARAGRAPH,
    text="Content...",
    page=1,
    bbox=bbox,
    confidence=0.95,
    sequence_index=0,
)

TableChunk

Table with cell structure:

from src.document_intelligence.chunks import TableChunk, TableCell

# Access cells
cell = table.get_cell(row=0, col=1)

# Export formats
csv_data = table.to_csv()
markdown = table.to_markdown()
json_data = table.to_structured_json()

EvidenceRef

Links extractions to visual sources:

from src.document_intelligence.chunks import EvidenceRef

evidence = EvidenceRef(
    chunk_id="chunk_001",
    doc_id="doc_001",
    page=1,
    bbox=bbox,
    source_type="text",
    snippet="The total is $500",
    confidence=0.9,
    cell_id=None,  # For table cells
    crop_path=None,  # Path to cropped image
)

CLI Commands

# Parse document
sparknet docint parse document.pdf -o result.json
sparknet docint parse document.pdf --format markdown

# Extract fields
sparknet docint extract invoice.pdf --preset invoice
sparknet docint extract doc.pdf -f vendor_name -f total_amount
sparknet docint extract doc.pdf --schema my_schema.json

# Ask questions
sparknet docint ask document.pdf "What is the contract value?"

# Classify document
sparknet docint classify document.pdf

# Search content
sparknet docint search document.pdf -q "payment terms"
sparknet docint search document.pdf --type table

# Visualize regions
sparknet docint visualize document.pdf --page 1 --annotate

Configuration

Parser Configuration

from src.document_intelligence import ParserConfig

config = ParserConfig(
    # Rendering
    render_dpi=200,          # DPI for page rasterization
    max_pages=None,          # Limit pages (None = all)

    # OCR
    ocr_enabled=True,
    ocr_languages=["en"],
    ocr_min_confidence=0.5,

    # Layout
    layout_enabled=True,
    reading_order_enabled=True,

    # Specialized extraction
    table_extraction_enabled=True,
    chart_extraction_enabled=True,

    # Chunking
    merge_adjacent_text=True,
    min_chunk_chars=10,
    max_chunk_chars=4000,

    # Output
    include_markdown=True,
    cache_enabled=True,
)

Extraction Configuration

from src.document_intelligence import ExtractionConfig

config = ExtractionConfig(
    # Confidence
    min_field_confidence=0.5,
    min_overall_confidence=0.5,

    # Abstention
    abstain_on_low_confidence=True,
    abstain_threshold=0.3,

    # Search
    search_all_chunks=True,
    prefer_structured_sources=True,

    # Validation
    validate_extracted_values=True,
    normalize_values=True,
)

Preset Schemas

Invoice

from src.document_intelligence import create_invoice_schema

schema = create_invoice_schema()
# Fields: invoice_number, invoice_date, due_date, vendor_name, vendor_address,
#         customer_name, customer_address, subtotal, tax_amount, total_amount,
#         currency, payment_terms

Receipt

from src.document_intelligence import create_receipt_schema

schema = create_receipt_schema()
# Fields: merchant_name, merchant_address, transaction_date, transaction_time,
#         subtotal, tax_amount, total_amount, payment_method, last_four_digits

Contract

from src.document_intelligence import create_contract_schema

schema = create_contract_schema()
# Fields: contract_title, effective_date, expiration_date, party_a_name,
#         party_b_name, contract_value, governing_law, termination_clause

Agent Integration

from src.document_intelligence.agent_adapter import (
    DocumentIntelligenceAdapter,
    EnhancedDocumentAgent,
    AgentConfig,
)

# Create adapter
config = AgentConfig(
    render_dpi=200,
    min_confidence=0.5,
    max_iterations=10,
)

# With existing LLM client
agent = EnhancedDocumentAgent(
    llm_client=ollama_client,
    config=config,
)

# Load document
await agent.load_document("document.pdf")

# Extract with schema
result = await agent.extract_fields(schema)

# Answer questions
answer, evidence = await agent.answer_question("What is the total?")

# Classify
classification = await agent.classify()

Available Tools

Tool Description
parse_document Parse document into semantic chunks
extract_fields Schema-driven field extraction
search_chunks Search document content
get_chunk_details Get detailed chunk information
get_table_data Extract structured table data
answer_question Document Q&A
crop_region Extract visual regions

Best Practices

1. Always Check Confidence

if extraction.overall_confidence < 0.7:
    print("Low confidence - manual review recommended")

for field, value in extraction.data.items():
    if field in extraction.abstained_fields:
        print(f"{field}: Needs manual verification")

2. Use Evidence for Verification

for evidence in extraction.evidence:
    print(f"Found on page {evidence.page}")
    print(f"Location: {evidence.bbox.xyxy}")
    print(f"Source text: {evidence.snippet}")

3. Handle Abstention Gracefully

result = extractor.extract(parse_result, schema)

for field in schema.get_required_fields():
    if field.name in result.abstained_fields:
        # Request human review
        flag_for_review(field.name, parse_result.doc_id)

4. Validate Before Use

from src.document_intelligence import ExtractionValidator

validator = ExtractionValidator(min_confidence=0.7)
validation = validator.validate(result, schema)

if not validation.is_valid:
    for issue in validation.issues:
        print(f"[{issue.severity}] {issue.field_name}: {issue.message}")

Dependencies

  • pymupdf - PDF loading and rendering
  • pillow - Image processing
  • numpy - Array operations
  • pydantic - Data validation

Optional:

  • paddleocr - OCR engine
  • tesseract - Alternative OCR
  • chromadb - Vector storage for RAG

License

MIT License - see LICENSE file for details.