Spaces:

Gangadhar123
/

pdf_ocr_extraction_1

Sleeping

App Files Files Community

pdf_ocr_extraction_1 / README.md

Gangadhar123

Update README.md

15c2ec2 verified 6 months ago

preview code

raw

history blame contribute delete

1.58 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: PDF OCR Extractor
emoji: 📄🔍
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false

📄 Insurance Claim PDF Extractor + LLM Q&A

This project extracts key information from uploaded insurance claim PDF forms using PyMuPDF and pytesseract, then allows users to query the extracted data using OpenAI's GPT-3.5 or GPT-4 via natural language questions.

🚀 Features

✅ PDF text extraction with fallback OCR (no Poppler dependency!)
✅ Configurable field extraction using regular expressions
✅ Gradio-based UI for uploading, previewing data, and querying via LLM
✅ Uses latest OpenAI Python SDK (>=1.x)

🔧 How It Works

Upload a scanned or digital insurance claim form in PDF format.
The app extracts relevant fields like policy_number, claimant_name, claim_amount, etc.
You can then ask questions about the extracted data using natural language (e.g., "What is the claim amount?").
OpenAI GPT answers based on the extracted JSON context.

🛠️ Tech Stack

gradio for UI
PyMuPDF (fitz) for PDF parsing
pytesseract for OCR
openai>=1.3.9 for language understanding
No use of pdf2image or poppler

📁 Project Structure

├── app.py # Gradio app logic ├── utils.py # PDF text + OCR extractor ├── extraction_service.py # Field extractor using regex config ├── fields_config.json # Custom field patterns ├── requirements.txt # Hugging Face Space dependencies └── README.md