A newer version of the Gradio SDK is available:
6.2.0
ML Models Integration Guide
This document explains how to train and use the ML models for conflict prediction and package similarity.
Overview
The project includes two ML models:
- Conflict Prediction Model: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
- Package Embeddings: Pre-computed semantic embeddings for common Python packages for similarity matching
Training the Models
Step 1: Install Training Dependencies
pip install scikit-learn sentence-transformers numpy
Step 2: Train Conflict Prediction Model
cd "code to upload"
python train_conflict_model.py
This will:
- Load the synthetic dataset (
synthetic_requirements_dataset.json) - Extract features from requirements
- Train a Random Forest classifier
- Save the model to
models/conflict_predictor.pkl - Display accuracy and feature importance
Expected Output:
- Model size: ~2-5 MB
- Test accuracy: ~85-95% (depending on dataset)
Step 3: Generate Package Embeddings
python generate_embeddings.py
This will:
- Load a sentence transformer model
- Generate embeddings for common Python packages
- Save embeddings to
models/package_embeddings.json - Save model info to
models/embedding_info.json
Expected Output:
- Embeddings file: ~5-10 MB
- Embedding dimension: 384
- Number of packages: ~100+
Model Files Structure
After training, you should have:
code to upload/
βββ models/
β βββ conflict_predictor.pkl # Classification model
β βββ package_embeddings.json # Pre-computed embeddings
β βββ embedding_info.json # Model metadata
Integration in Main App
The models are automatically loaded when available:
- Conflict Prediction: Runs before detailed analysis to provide early warnings
- Package Similarity: Enhances spell-checking with semantic matching
Features
- Graceful Fallback: If models aren't available, the app works with rule-based methods
- Lazy Loading: Models load only when needed
- Error Handling: ML failures don't break the app
Usage in Code
Conflict Prediction
from ml_models import ConflictPredictor
predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)
if has_conflict:
print(f"Conflict predicted with {confidence:.1%} confidence")
Package Similarity
from ml_models import PackageEmbeddings
embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]
best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'
Hugging Face Spaces Deployment
Option 1: Include Models in Repo
- Train models locally
- Commit model files to the repo
- Models load automatically on Spaces
Pros: Simple, no external dependencies
Cons: Larger repo size (~10-15 MB)
Option 2: Upload to Hugging Face Hub
- Train models locally
- Upload to Hugging Face Hub:
from huggingface_hub import upload_file upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor") - Load from Hub in app:
from huggingface_hub import hf_hub_download model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")
Pros: Smaller repo, version control for models
Cons: Requires internet connection at startup
Performance
- Conflict Prediction: <10ms per prediction
- Embedding Lookup: <1ms (pre-computed) or ~50ms (on-the-fly)
- Model Loading: ~1-2 seconds at startup
Troubleshooting
Models Not Loading
- Check that
models/directory exists - Verify model files are present
- Check file permissions
Low Prediction Accuracy
- Retrain with more data
- Adjust feature engineering
- Try different model parameters
Embeddings Not Working
- Ensure
sentence-transformersis installed - Check internet connection (for first-time model download)
- Verify embeddings file format
Future Improvements
- Train on larger, real-world dataset
- Add version-specific embeddings
- Implement online learning
- Add confidence intervals
- Support for custom model paths