Spaces:

Seth0330
/

DocClassify

Sleeping

App Files Files Community

DocClassify / README.md

Seth

Update

f6e574f 25 days ago

|

history blame contribute delete

2.44 kB

title: DocClassify
emoji: 📄
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false

Document Classifier

A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results.

Features

📄 PDF file upload and processing
🤖 BERT-tiny model for document classification
🎯 Classifies 20+ document types including:
- Invoice, Receipt, Contract, Resume
- Letter, Report, Memo, Email
- Form, Certificate, License, Passport
- Medical records, Bank statements, Tax documents
- Legal documents, Academic papers, and more
💾 Model is downloaded and cached locally on first use
🎨 Modern, user-friendly interface

How It Works

The app uses the prajjwal1/bert-tiny model from Hugging Face
On first run, the model is automatically downloaded to the models/ directory
PDF text is extracted using PyPDF2
Document embeddings are computed using BERT-tiny
Similarity scores are calculated against pre-computed document type embeddings
The document is classified with confidence scores

Setup

Local Development

Backend Setup:

cd backend
pip install -r requirements.txt

Frontend Setup:
```
cd frontend
npm install
```

Run Backend:

cd backend
uvicorn app.main:app --reload --port 8000

Run Frontend:
```
cd frontend
npm run dev
```
Open http://localhost:5173 in your browser

Docker Deployment

docker build -t docclassify .
docker run -p 7860:7860 docclassify

Usage

Click "Select PDF File" to choose a PDF document
Click "Classify Document" to process the file
View the classification result with confidence scores
See top 5 document type predictions

Model Information

Model: prajjwal1/bert-tiny
Size: ~4.4M parameters
Architecture: BERT (L=2, H=128)
Source: Hugging Face Model Card

Technical Stack

Backend: FastAPI, PyTorch, Transformers, PyPDF2
Frontend: React, Vite
Model: BERT-tiny (prajjwal1/bert-tiny)

Notes

The model will be automatically downloaded on first use (~17MB)
Classification works best with text-based PDFs
Image-based PDFs may not work if they don't contain extractable text
Processing time depends on document size and system resources