Spaces:
Sleeping
Sleeping
metadata
title: DocClassify
emoji: π
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
Document Classifier
A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results.
Features
- π PDF file upload and processing
- π€ BERT-tiny model for document classification
- π― Classifies 20+ document types including:
- Invoice, Receipt, Contract, Resume
- Letter, Report, Memo, Email
- Form, Certificate, License, Passport
- Medical records, Bank statements, Tax documents
- Legal documents, Academic papers, and more
- πΎ Model is downloaded and cached locally on first use
- π¨ Modern, user-friendly interface
How It Works
- The app uses the
prajjwal1/bert-tinymodel from Hugging Face - On first run, the model is automatically downloaded to the
models/directory - PDF text is extracted using PyPDF2
- Document embeddings are computed using BERT-tiny
- Similarity scores are calculated against pre-computed document type embeddings
- The document is classified with confidence scores
Setup
Local Development
Backend Setup:
cd backend pip install -r requirements.txtFrontend Setup:
cd frontend npm installRun Backend:
cd backend uvicorn app.main:app --reload --port 8000Run Frontend:
cd frontend npm run devOpen
http://localhost:5173in your browser
Docker Deployment
docker build -t docclassify .
docker run -p 7860:7860 docclassify
Usage
- Click "Select PDF File" to choose a PDF document
- Click "Classify Document" to process the file
- View the classification result with confidence scores
- See top 5 document type predictions
Model Information
- Model:
prajjwal1/bert-tiny - Size: ~4.4M parameters
- Architecture: BERT (L=2, H=128)
- Source: Hugging Face Model Card
Technical Stack
- Backend: FastAPI, PyTorch, Transformers, PyPDF2
- Frontend: React, Vite
- Model: BERT-tiny (prajjwal1/bert-tiny)
Notes
- The model will be automatically downloaded on first use (~17MB)
- Classification works best with text-based PDFs
- Image-based PDFs may not work if they don't contain extractable text
- Processing time depends on document size and system resources