Gangadhar123's picture
Update README.md
15c2ec2 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: PDF OCR Extractor
emoji: πŸ“„πŸ”
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false

πŸ“„ Insurance Claim PDF Extractor + LLM Q&A

This project extracts key information from uploaded insurance claim PDF forms using PyMuPDF and pytesseract, then allows users to query the extracted data using OpenAI's GPT-3.5 or GPT-4 via natural language questions.


πŸš€ Features

  • βœ… PDF text extraction with fallback OCR (no Poppler dependency!)
  • βœ… Configurable field extraction using regular expressions
  • βœ… Gradio-based UI for uploading, previewing data, and querying via LLM
  • βœ… Uses latest OpenAI Python SDK (>=1.x)

πŸ”§ How It Works

  1. Upload a scanned or digital insurance claim form in PDF format.
  2. The app extracts relevant fields like policy_number, claimant_name, claim_amount, etc.
  3. You can then ask questions about the extracted data using natural language (e.g., "What is the claim amount?").
  4. OpenAI GPT answers based on the extracted JSON context.

πŸ› οΈ Tech Stack

  • gradio for UI
  • PyMuPDF (fitz) for PDF parsing
  • pytesseract for OCR
  • openai>=1.3.9 for language understanding
  • No use of pdf2image or poppler

πŸ“ Project Structure

β”œβ”€β”€ app.py # Gradio app logic β”œβ”€β”€ utils.py # PDF text + OCR extractor β”œβ”€β”€ extraction_service.py # Field extractor using regex config β”œβ”€β”€ fields_config.json # Custom field patterns β”œβ”€β”€ requirements.txt # Hugging Face Space dependencies └── README.md