Yash Sakhale
Initial commit: Python Dependency Compatibility Board with ML and LLM features
329b91e

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

ML Models Integration Guide

This document explains how to train and use the ML models for conflict prediction and package similarity.

Overview

The project includes two ML models:

  1. Conflict Prediction Model: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
  2. Package Embeddings: Pre-computed semantic embeddings for common Python packages for similarity matching

Training the Models

Step 1: Install Training Dependencies

pip install scikit-learn sentence-transformers numpy

Step 2: Train Conflict Prediction Model

cd "code to upload"
python train_conflict_model.py

This will:

  • Load the synthetic dataset (synthetic_requirements_dataset.json)
  • Extract features from requirements
  • Train a Random Forest classifier
  • Save the model to models/conflict_predictor.pkl
  • Display accuracy and feature importance

Expected Output:

  • Model size: ~2-5 MB
  • Test accuracy: ~85-95% (depending on dataset)

Step 3: Generate Package Embeddings

python generate_embeddings.py

This will:

  • Load a sentence transformer model
  • Generate embeddings for common Python packages
  • Save embeddings to models/package_embeddings.json
  • Save model info to models/embedding_info.json

Expected Output:

  • Embeddings file: ~5-10 MB
  • Embedding dimension: 384
  • Number of packages: ~100+

Model Files Structure

After training, you should have:

code to upload/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ conflict_predictor.pkl      # Classification model
β”‚   β”œβ”€β”€ package_embeddings.json     # Pre-computed embeddings
β”‚   └── embedding_info.json         # Model metadata

Integration in Main App

The models are automatically loaded when available:

  1. Conflict Prediction: Runs before detailed analysis to provide early warnings
  2. Package Similarity: Enhances spell-checking with semantic matching

Features

  • Graceful Fallback: If models aren't available, the app works with rule-based methods
  • Lazy Loading: Models load only when needed
  • Error Handling: ML failures don't break the app

Usage in Code

Conflict Prediction

from ml_models import ConflictPredictor

predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)

if has_conflict:
    print(f"Conflict predicted with {confidence:.1%} confidence")

Package Similarity

from ml_models import PackageEmbeddings

embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]

best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'

Hugging Face Spaces Deployment

Option 1: Include Models in Repo

  1. Train models locally
  2. Commit model files to the repo
  3. Models load automatically on Spaces

Pros: Simple, no external dependencies
Cons: Larger repo size (~10-15 MB)

Option 2: Upload to Hugging Face Hub

  1. Train models locally
  2. Upload to Hugging Face Hub:
    from huggingface_hub import upload_file
    upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")
    
  3. Load from Hub in app:
    from huggingface_hub import hf_hub_download
    model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")
    

Pros: Smaller repo, version control for models
Cons: Requires internet connection at startup

Performance

  • Conflict Prediction: <10ms per prediction
  • Embedding Lookup: <1ms (pre-computed) or ~50ms (on-the-fly)
  • Model Loading: ~1-2 seconds at startup

Troubleshooting

Models Not Loading

  • Check that models/ directory exists
  • Verify model files are present
  • Check file permissions

Low Prediction Accuracy

  • Retrain with more data
  • Adjust feature engineering
  • Try different model parameters

Embeddings Not Working

  • Ensure sentence-transformers is installed
  • Check internet connection (for first-time model download)
  • Verify embeddings file format

Future Improvements

  • Train on larger, real-world dataset
  • Add version-specific embeddings
  • Implement online learning
  • Add confidence intervals
  • Support for custom model paths