setu / docs /pdf_processing.md
khagu's picture
chore: finally untrack large database files
3998131

PDF Processing Module for Bias Detection

Overview

The PDF Processing module (utility/pdf_processor.py) provides a complete pipeline for extracting text from Nepali PDFs and preparing sentences for bias detection analysis.

Key Features:

  • ✓ PDF text extraction using PyMuPDF (fitz)
  • ✓ Intelligent Nepali sentence segmentation
  • ✓ LLM-based sentence refinement using Mistral
  • ✓ Integration with bias detection API
  • ✓ File upload support via API endpoints
  • ✓ Error handling and logging

Architecture

User Upload (PDF)
       ↓
PDFProcessor.process_pdf_from_bytes()
       ↓
[Extract Text] → [Clean Text] → [Split Sentences] → [Refine with LLM]
       ↓
List of Refined Sentences
       ↓
Bias Detection Model
       ↓
Bias Analysis Results

Installation

Required Dependencies

# PyMuPDF for PDF text extraction
pip install pymupdf

# Already included in module_a
# mistralai - Mistral LLM client

Setup

  1. Ensure mistralai is installed in your environment
  2. Set MISTRAL_API_KEY environment variable
  3. Module uses existing MistralClient from module_a/llm_client.py

Usage

1. Basic Python Usage

from utility.pdf_processor import PDFProcessor

# Initialize processor
processor = PDFProcessor()

# Process PDF from file path
result = processor.process_pdf(
    pdf_path="path/to/document.pdf",
    refine_with_llm=True
)

if result["success"]:
    sentences = result["sentences"]
    print(f"Extracted {result['total_sentences']} sentences")
    for sentence in sentences:
        print(f"- {sentence}")

2. Process from Bytes (File Uploads)

# For API file uploads
processor = PDFProcessor()

pdf_bytes = await request.file.read()
result = processor.process_pdf_from_bytes(
    pdf_bytes=pdf_bytes,
    refine_with_llm=True
)

3. API Endpoints

A. Extract Sentences Only

Endpoint: POST /api/v1/process-pdf

Request:

curl -X POST "http://localhost:8000/api/v1/process-pdf" \
  -F "file=@nepali_document.pdf" \
  -F "refine_with_llm=true"

Response:

{
  "success": true,
  "sentences": [
    "पहिलो वाक्य यहाँ छ।",
    "दोस्रो वाक्य यहाँ छ।",
    "तेस्रो वाक्य यहाँ छ।"
  ],
  "total_sentences": 3,
  "filename": "nepali_document.pdf",
  "raw_text": "पहिलो वाक्य यहाँ छ। दोस्रो वाक्य यहाँ छ। तेस्रो वाक्य यहाँ छ।"
}

B. Extract Sentences + Bias Detection

Endpoint: POST /api/v1/process-pdf-to-bias

Request:

curl -X POST "http://localhost:8000/api/v1/process-pdf-to-bias" \
  -F "file=@nepali_document.pdf" \
  -F "refine_with_llm=true" \
  -F "confidence_threshold=0.7"

Response:

{
  "success": true,
  "total_sentences": 3,
  "biased_count": 1,
  "neutral_count": 2,
  "results": [
    {
      "sentence": "पहिलो वाक्य यहाँ छ।",
      "category": "neutral",
      "confidence": 0.95,
      "is_biased": false
    },
    {
      "sentence": "दोस्रो वाक्य यहाँ छ।",
      "category": "gender",
      "confidence": 0.82,
      "is_biased": true
    },
    {
      "sentence": "तेस्रो वाक्य यहाँ छ।",
      "category": "neutral",
      "confidence": 0.91,
      "is_biased": false
    }
  ],
  "filename": "nepali_document.pdf"
}

C. Service Health Check

Endpoint: GET /api/v1/pdf-health

Response:

{
  "status": "healthy",
  "pdf_processor": "ready",
  "mistral_client": "connected",
  "features": {
    "pdf_extraction": true,
    "sentence_segmentation": true,
    "llm_refinement": true
  }
}

API Schemas

PDFProcessingResponse

{
    "success": bool,
    "sentences": List[str],
    "total_sentences": int,
    "raw_text": Optional[str],
    "error": Optional[str],
    "filename": Optional[str]
}

PDFToBiasDetectionResponse

{
    "success": bool,
    "total_sentences": int,
    "biased_count": int,
    "neutral_count": int,
    "results": List[BiasResult],
    "error": Optional[str],
    "filename": Optional[str]
}

Where BiasResult:

{
    "sentence": str,
    "category": str,
    "confidence": float,
    "is_biased": bool
}

Processing Pipeline

Step 1: Text Extraction

  • Uses PyMuPDF (fitz) to extract text from PDF
  • Handles multi-page documents
  • Detects image-based PDFs (requires OCR)

Step 2: Text Cleaning

  • Removes extra whitespace
  • Normalizes newlines
  • Fixes formatting issues

Step 3: Sentence Segmentation

  • Uses regex patterns for Nepali sentence boundaries
  • Recognizes: । (danda), . , ! , ?
  • Filters out short fragments (< 5 characters)

Step 4: LLM Refinement (Optional)

  • Sends sentences to Mistral LLM
  • Corrects mis-segmented sentences
  • Removes duplicates
  • Returns properly formatted JSON array

Configuration

Environment Variables

# Required for LLM refinement
export MISTRAL_API_KEY="your-api-key"

# Optional
export MISTRAL_MODEL="mistral-small"  # Default: mistral-small
export LOG_LEVEL="INFO"

Processing Options

# With LLM refinement (more accurate, slower)
result = processor.process_pdf(
    pdf_path="document.pdf",
    refine_with_llm=True  # Uses Mistral LLM
)

# Without LLM refinement (faster, regex-based)
result = processor.process_pdf(
    pdf_path="document.pdf",
    refine_with_llm=False  # Regex-based segmentation only
)

Error Handling

The module handles various error scenarios:

result = processor.process_pdf(pdf_path="file.pdf")

if not result["success"]:
    error = result["error"]
    # Possible errors:
    # - "No text could be extracted from the PDF"
    # - "Could not segment sentences from extracted text"
    # - "PDF might be image-based (requires OCR)"
    # - "File not found: path/to/file.pdf"

Performance Considerations

Execution Time Estimates

Operation Time Notes
PDF Text Extraction ~100-500ms Depends on PDF size
Sentence Segmentation ~50-200ms Regex-based
LLM Refinement ~2-5s API call to Mistral
Total (with LLM) ~3-6s Per document
Total (without LLM) ~150-700ms Regex only

Optimization Tips

  1. Disable LLM refinement for faster processing when accuracy is less critical
  2. Batch multiple PDFs to amortize API overhead
  3. Cache results if processing same PDFs repeatedly

Integration with Bias Detection

Workflow

1. User uploads PDF
   ↓
2. Extract sentences using PDFProcessor
   ↓
3. Send sentences to Bias Detection model
   ↓
4. Classify each sentence (neutral/gender/caste/religion/etc.)
   ↓
5. Return analysis results to user

Code Example

from utility.pdf_processor import PDFProcessor
from api.routes.bias_detection import run_bias_detection

processor = PDFProcessor()

# Process PDF
pdf_result = processor.process_pdf(
    pdf_path="document.pdf",
    refine_with_llm=True
)

if pdf_result["success"]:
    sentences = pdf_result["sentences"]
    combined_text = " ".join(sentences)
    
    # Run bias detection
    bias_result = run_bias_detection(
        text=combined_text,
        confidence_threshold=0.7
    )
    
    print(f"Biased sentences: {bias_result.biased_count}")
    print(f"Neutral sentences: {bias_result.neutral_count}")

Nepali Language Support

Character Range Supported

The module recognizes Nepali character ranges:

  • Consonants: अ-ह
  • Vowels: ा-ौ
  • Special characters: ँ-ॿ

Sentence Boundaries

Recognized punctuation:

  • Danda (primary Nepali punctuation)
  • . Period
  • ! Exclamation mark
  • ? Question mark

Logging

Enable debug logging to track processing:

import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("utility.pdf_processor")

# Now see detailed logs
processor = PDFProcessor()
result = processor.process_pdf("document.pdf")

Files Structure

utility/
├── __init__.py                 # Module initialization
├── pdf_processor.py            # Main PDFProcessor class
├── pdf_processor_examples.py   # Usage examples
└── docs/
    └── pdf_processing.md       # This file

api/
├── routes/
│   └── pdf_processing.py       # API endpoints
└── schemas.py                  # Pydantic models

Troubleshooting

Issue: "No text could be extracted from the PDF"

Cause: PDF is image-based (scanned document) Solution: Requires OCR support (future enhancement)

Issue: "LLM refinement failed"

Cause: Mistral API key missing or network error Solution: Check MISTRAL_API_KEY environment variable

Issue: Sentences are too short or fragmented

Solution: Sentences shorter than 5 characters are filtered. Adjust threshold in code if needed.

Issue: Slow processing with LLM

Solution:

  • Disable LLM refinement (refine_with_llm=False) for speed
  • Use smaller batch sizes
  • Check network latency to Mistral API

Future Enhancements

  • OCR support for scanned PDFs
  • Language detection and auto-switching
  • Caching layer for repeated PDFs
  • Batch processing optimization
  • Support for other document formats (DOCX, TXT)
  • Custom Nepali dictionary for better segmentation

License

Part of Nepal Justice Weaver project