Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
metadata
title: PDF OCR Extractor
emoji: ππ
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
π Insurance Claim PDF Extractor + LLM Q&A
This project extracts key information from uploaded insurance claim PDF forms using PyMuPDF and pytesseract, then allows users to query the extracted data using OpenAI's GPT-3.5 or GPT-4 via natural language questions.
π Features
- β PDF text extraction with fallback OCR (no Poppler dependency!)
- β Configurable field extraction using regular expressions
- β Gradio-based UI for uploading, previewing data, and querying via LLM
- β
Uses latest OpenAI Python SDK (
>=1.x)
π§ How It Works
- Upload a scanned or digital insurance claim form in PDF format.
- The app extracts relevant fields like
policy_number,claimant_name,claim_amount, etc. - You can then ask questions about the extracted data using natural language (e.g., "What is the claim amount?").
- OpenAI GPT answers based on the extracted JSON context.
π οΈ Tech Stack
gradiofor UIPyMuPDF (fitz)for PDF parsingpytesseractfor OCRopenai>=1.3.9for language understanding- No use of
pdf2imageorpoppler
π Project Structure
βββ app.py # Gradio app logic βββ utils.py # PDF text + OCR extractor βββ extraction_service.py # Field extractor using regex config βββ fields_config.json # Custom field patterns βββ requirements.txt # Hugging Face Space dependencies βββ README.md