DocClassify / README.md
Seth
Update
f6e574f
metadata
title: DocClassify
emoji: πŸ“„
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false

Document Classifier

A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results.

Features

  • πŸ“„ PDF file upload and processing
  • πŸ€– BERT-tiny model for document classification
  • 🎯 Classifies 20+ document types including:
    • Invoice, Receipt, Contract, Resume
    • Letter, Report, Memo, Email
    • Form, Certificate, License, Passport
    • Medical records, Bank statements, Tax documents
    • Legal documents, Academic papers, and more
  • πŸ’Ύ Model is downloaded and cached locally on first use
  • 🎨 Modern, user-friendly interface

How It Works

  1. The app uses the prajjwal1/bert-tiny model from Hugging Face
  2. On first run, the model is automatically downloaded to the models/ directory
  3. PDF text is extracted using PyPDF2
  4. Document embeddings are computed using BERT-tiny
  5. Similarity scores are calculated against pre-computed document type embeddings
  6. The document is classified with confidence scores

Setup

Local Development

  1. Backend Setup:

    cd backend
    pip install -r requirements.txt
    
  2. Frontend Setup:

    cd frontend
    npm install
    
  3. Run Backend:

    cd backend
    uvicorn app.main:app --reload --port 8000
    
  4. Run Frontend:

    cd frontend
    npm run dev
    
  5. Open http://localhost:5173 in your browser

Docker Deployment

docker build -t docclassify .
docker run -p 7860:7860 docclassify

Usage

  1. Click "Select PDF File" to choose a PDF document
  2. Click "Classify Document" to process the file
  3. View the classification result with confidence scores
  4. See top 5 document type predictions

Model Information

  • Model: prajjwal1/bert-tiny
  • Size: ~4.4M parameters
  • Architecture: BERT (L=2, H=128)
  • Source: Hugging Face Model Card

Technical Stack

  • Backend: FastAPI, PyTorch, Transformers, PyPDF2
  • Frontend: React, Vite
  • Model: BERT-tiny (prajjwal1/bert-tiny)

Notes

  • The model will be automatically downloaded on first use (~17MB)
  • Classification works best with text-based PDFs
  • Image-based PDFs may not work if they don't contain extractable text
  • Processing time depends on document size and system resources