Spaces:

Really-amin
/

Hoghoghi

Paused

App Files Files Community

Hoghoghi / huggingface_space /README.md

Really-amin

Upload 46 files

922c3ba verified 3 months ago

preview code

raw

history blame contribute delete

4.43 kB

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

Legal Dashboard OCR - Hugging Face Space

AI-powered Persian legal document processing system with advanced OCR capabilities using Hugging Face models.

🚀 Live Demo

This Space provides a web interface for processing Persian legal documents with OCR and AI analysis.

✨ Features

📄 PDF Processing: Upload and extract text from Persian legal documents
🤖 AI Analysis: Intelligent document scoring and categorization
🏷️ Auto-Categorization: AI-driven document category prediction
📊 Dashboard: Real-time analytics and document statistics
💾 Document Storage: Save and manage processed documents
🔍 OCR Pipeline: Advanced text extraction with confidence scoring

🛠️ Usage

1. Upload Document

Click "Upload PDF Document" to select a Persian legal document
Supported formats: PDF files

2. Process Document

Click "🔍 Process PDF" to extract text using OCR
View extracted text, AI analysis, and OCR information
Review confidence scores and processing time

3. Save Document (Optional)

Add document title, source, and category
Click "💾 Process & Save" to store in database
View saved document ID for future reference

4. View Dashboard

Switch to "📊 Dashboard" tab
Click "🔄 Refresh Statistics" to see latest analytics
View total documents, average scores, and top categories

🔧 Technical Details

OCR Models

Microsoft TrOCR: Base model for printed text extraction
Persian Language Support: Optimized for Persian/Farsi documents
Confidence Scoring: Quality assessment for extracted text

AI Scoring Engine

Keyword Relevance: 30% weight
Document Completeness: 25% weight
Recency: 20% weight
Source Credibility: 15% weight
Document Quality: 10% weight

📊 API Endpoints

The system also provides RESTful API endpoints:

POST /api/ocr/process - Process PDF with OCR
POST /api/documents/ - Save processed document
GET /api/dashboard/summary - Get dashboard statistics
GET /api/documents/ - List all documents

🏗️ Architecture

huggingface_space/
├── app.py              # Gradio interface entry point
├── Spacefile           # Hugging Face Space configuration
├── README.md           # This documentation
└── requirements.txt    # Python dependencies

🔍 Troubleshooting

Common Issues

Model Loading: First run may take time to download OCR models
File Size: Large PDFs may take longer to process
Text Quality: Clear, well-scanned documents work best
Language: Optimized for Persian/Farsi text

Performance Tips

Use clear, high-resolution PDF scans
Avoid handwritten text for best results
Process documents during off-peak hours
Check confidence scores for quality assessment

📈 Performance Metrics

OCR Accuracy: 85-95% for clear printed text
Processing Time: 5-30 seconds per page
Model Size: ~1.5GB (automatically cached)
Memory Usage: ~2GB RAM during processing

🔒 Privacy & Security

No Data Retention: Uploaded files are processed temporarily
Secure Processing: All operations run in isolated environment
No External Storage: Files are not stored permanently
Open Source: Full transparency of processing pipeline

🤝 Contributing

This Space is part of the Legal Dashboard OCR project. For contributions:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📞 Support

For issues or questions:

Check the logs for error messages
Verify PDF format and quality
Test with sample documents first
Review the API documentation

🎯 Future Enhancements

Real-time WebSocket updates
Batch document processing
Advanced AI models
Mobile app integration
User authentication
Document versioning

Built with: Gradio, Hugging Face Transformers, FastAPI, SQLite

Models: Microsoft TrOCR, Custom AI Scoring Engine

Language: Persian/Farsi Legal Documents