Spaces:
Sleeping
Fine-Tuning System Implementation Plan
Overview
Implement an active learning system that collects admin corrections, builds a training dataset, and fine-tunes the BART classification model using LoRA (Low-Rank Adaptation).
Phase 1: Training Data Collection Infrastructure
1.1 Database Schema Extensions
New Model: TrainingExample
id(Integer, PK)submission_id(Integer, FK to Submission)message(Text) - snapshot of submission textoriginal_category(String, nullable) - AI's initial predictioncorrected_category(String) - Admin's correctioncontributor_type(String)correction_timestamp(DateTime)confidence_score(Float, nullable) - original prediction confidenceused_in_training(Boolean, default=False) - track if used in fine-tuningtraining_run_id(Integer, nullable, FK) - which training run used this
New Model: FineTuningRun
id(Integer, PK)created_at(DateTime)status(String) - 'preparing', 'training', 'evaluating', 'completed', 'failed'num_training_examples(Integer)num_validation_examples(Integer)num_test_examples(Integer)training_config(JSON) - hyperparameters, LoRA configresults(JSON) - metrics (accuracy, loss, per-category F1)model_path(String, nullable) - path to saved LoRA weightsis_active_model(Boolean) - currently deployed modelimprovement_over_baseline(Float, nullable)completed_at(DateTime, nullable)
1.2 Admin Routes Extension (app/routes/admin.py)
Modify update_category endpoint:
- When admin changes category, create TrainingExample record
- Capture: original prediction, corrected category, confidence score
- Track whether it's a correction (different from AI) or confirmation (same)
New endpoints:
GET /admin/training-data- View collected training examplesGET /admin/api/training-stats- Stats on corrections collectedDELETE /admin/api/training-example/<id>- Remove bad examples
Phase 2: Fine-Tuning Configuration UI
2.1 New Admin Page: Training Dashboard (app/templates/admin/training.html)
Sections:
Training Data Stats
- Total corrections collected
- Per-category distribution
- Corrections vs confirmations ratio
- Data quality indicators (duplicates, conflicts)
Fine-Tuning Controls (enabled when ≥20 examples)
- Configure training parameters:
- Minimum examples threshold (default: 20)
- Train/Val/Test split (e.g., 70/15/15)
- LoRA rank (r=8, 16, 32)
- Learning rate (1e-4 to 5e-4)
- Number of epochs (3-5)
- "Start Fine-Tuning" button (with confirmation)
- Configure training parameters:
Training History
- Table of past FineTuningRun records
- Show: date, examples used, accuracy, status
- Actions: View details, Deploy model, Export weights
Active Model Indicator
- Show which model is currently in use
- Option to rollback to base model
2.2 Settings Extension
fine_tuning_enabled(Boolean) - master switchmin_training_examples(Integer, default: 20)auto_train(Boolean, default: False) - auto-trigger when threshold reached
Phase 3: Fine-Tuning Engine
3.1 New Module: app/fine_tuning/trainer.py
Class: BARTFineTuner
Methods:
prepare_dataset(training_examples)
- Convert TrainingExample records to HuggingFace Dataset
- Create train/val/test splits (stratified by category)
- Tokenize texts for BART
- Return:
train_dataset,val_dataset,test_dataset
setup_lora_model(base_model_name, lora_config)
- Load base BART model (
facebook/bart-large-mnli) - Apply PEFT (Parameter-Efficient Fine-Tuning) with LoRA
- LoRA configuration:
{ "r": 16, # rank "lora_alpha": 32, "target_modules": ["q_proj", "v_proj"], # attention layers "lora_dropout": 0.1, "bias": "none" }
train(train_dataset, val_dataset, config)
- Use HuggingFace Trainer with custom loss
- Multi-class cross-entropy loss
- Metrics: accuracy, F1 per category, confusion matrix
- Early stopping on validation loss
- Save checkpoints to
/data/models/finetuned/run_{id}/
evaluate(test_dataset, model)
- Run predictions on test set
- Calculate: accuracy, precision, recall, F1 (macro/micro)
- Generate confusion matrix
- Compare to baseline (zero-shot) performance
export_model(run_id, destination_path)
- Save LoRA adapter weights
- Save tokenizer config
- Create model card with metrics
- Package for backup/deployment
Alternative Approach: Output Layer Fine-Tuning
- Option to only train final classification head
- Faster, less prone to overfitting
- Good for small datasets (20-50 examples)
3.2 Background Task Handler (app/fine_tuning/tasks.py)
- Fine-tuning runs in background (avoid blocking Flask)
- Options:
- Simple Threading (for development)
- Celery (for production) - requires Redis/RabbitMQ
- HF Spaces Gradio Jobs (if deploying to HF)
Status Updates:
- Update FineTuningRun.status in real-time
- Store progress in Settings table for UI polling
- Log to file for debugging
Phase 4: Model Deployment & Versioning
4.1 Model Manager (app/fine_tuning/model_manager.py)
Class: ModelManager
get_active_model()
- Check if fine-tuned model is deployed
- Load LoRA weights if available
- Fallback to base model
deploy_model(run_id)
- Set FineTuningRun.is_active_model = True
- Update Settings:
active_model_id - Reload analyzer with new model
- Create deployment snapshot
rollback_to_baseline()
- Deactivate all fine-tuned models
- Reload base BART model
- Log rollback event
compare_models(run_id_1, run_id_2, test_dataset)
- Side-by-side comparison
- Statistical significance tests
- A/B testing support (future)
4.2 Analyzer Modification (app/analyzer.py)
Update SubmissionAnalyzer.__init__:
- Check for active fine-tuned model
- Load LoRA adapter if available
- Track model version being used
Add method: get_model_info()
- Return: model type (base/finetuned), version, metrics
Store prediction metadata:
- Add confidence scores to all predictions
- Track which model version made prediction
Phase 5: Validation & Quality Assurance
5.1 Cross-Validation
- K-fold cross-validation (k=5) for small datasets
- Stratified splits to ensure category balance
- Report: mean ± std accuracy across folds
5.2 Minimum Viable Training Set
Data Requirements:
- At least 3 examples per category (18 total)
- Recommended: 5+ examples per category (30 total)
- Warn if severe class imbalance (>5:1 ratio)
5.3 Quality Checks
- Detect duplicate texts
- Detect conflicting labels (same text, different categories)
- Flag suspiciously short/long texts
- Admin review interface for cleanup
5.4 Success Criteria
Model is deployed if:
- Test accuracy > baseline accuracy + 5%
- OR per-category F1 improved for majority of categories
- AND no category has F1 < 0.3 (catch catastrophic forgetting)
If criteria not met:
- Keep base model active
- Suggest: collect more data, adjust hyperparameters
Phase 6: Export & Backup
6.1 Model Export
Format Options:
- HuggingFace Hub - push LoRA adapter to private repo
- Local Files - save to
/data/models/exports/ - Download via UI - ZIP file with weights + config
Export Contents:
- LoRA adapter weights (
adapter_model.bin) - Adapter config (
adapter_config.json) - Training metrics (
metrics.json) - Training examples used (
training_data.json) - Model card (
README.md)
6.2 Import Pre-trained Model
- Upload ZIP with LoRA weights
- Validate compatibility with base model
- Deploy to production
Technical Implementation Details
Dependencies to Add (requirements.txt)
peft>=0.7.0 # LoRA implementation
datasets>=2.14.0 # HuggingFace datasets
scikit-learn>=1.3.0 # cross-validation, metrics
matplotlib>=3.7.0 # confusion matrix plotting
seaborn>=0.12.0 # visualization
accelerate>=0.24.0 # training optimization
evaluate>=0.4.0 # evaluation metrics
File Structure
app/
├── fine_tuning/
│ ├── __init__.py
│ ├── trainer.py # BARTFineTuner class
│ ├── model_manager.py # Model deployment logic
│ ├── tasks.py # Background job handler
│ ├── metrics.py # Custom evaluation metrics
│ └── data_validator.py # Training data QA
├── models/
│ └── models.py # Add TrainingExample, FineTuningRun
├── routes/
│ └── admin.py # Add training endpoints
├── templates/admin/
│ └── training.html # Training dashboard UI
└── analyzer.py # Update to support LoRA models
/data/models/ # Persistent storage (HF Spaces)
├── finetuned/
│ ├── run_1/
│ ├── run_2/
│ └── ...
└── exports/
API Endpoints Summary
GET /admin/training- Training dashboard pageGET /admin/api/training-stats- Get correction statsGET /admin/api/training-examples- List training dataDELETE /admin/api/training-example/<id>- Remove examplePOST /admin/api/start-training- Trigger fine-tuningGET /admin/api/training-status/<run_id>- Poll training progressPOST /admin/api/deploy-model/<run_id>- Deploy fine-tuned modelPOST /admin/api/rollback-model- Revert to base modelGET /admin/api/export-model/<run_id>- Download model weights
UI Workflow
- Admin corrects categories on Submissions page (already working)
- Navigate to Training tab in admin panel
- View stats: "25 corrections collected (Ready to train!)"
- Click "Start Fine-Tuning" → Configure parameters → Confirm
- Progress bar shows: "Preparing data... Training... Evaluating..."
- Results displayed: "Accuracy: 87% (+12% improvement!)"
- Click "Deploy Model" to activate
- All future predictions use fine-tuned model
Performance Considerations
- Training Time: ~2-5 minutes for 20-50 examples (CPU)
- Memory: LoRA uses ~10% of full fine-tuning memory
- Storage: ~50MB per LoRA checkpoint
- Inference: Minimal overhead vs base model
Risk Mitigation
- Overfitting: Use validation set, early stopping
- Catastrophic Forgetting: Monitor all category metrics
- Bad Training Data: Quality validation before training
- Model Regression: Always compare to baseline, allow rollback
- Resource Limits: LoRA keeps training feasible on HF Spaces
Implementation Phases
Phase 1 (Foundation): Database models + data collection (2-3 hours) Phase 2 (UI): Training dashboard + configuration (2-3 hours) Phase 3 (Core ML): Fine-tuning engine + LoRA (4-5 hours) Phase 4 (Deployment): Model management + versioning (2-3 hours) Phase 5 (QA): Validation + metrics (2-3 hours) Phase 6 (Polish): Export/import + documentation (1-2 hours)
Total Estimated Time: 13-19 hours
Questions for Clarification
- Training Infrastructure: Run on HF Spaces (CPU) or local machine (GPU)?
- Background Jobs: Use simple threading or prefer Celery/Redis?
- Model Hosting: Keep models in HF Spaces persistent storage or upload to HF Hub?
- Auto-training: Should system auto-train when threshold reached, or admin-triggered only?
- Notification: Email/webhook when training completes?
- Multi-model: Support multiple fine-tuned models simultaneously (A/B testing)?
Ready to proceed with implementation upon your approval!