Spaces:

ysakhale
/

python-dependency-compatibility-board

Sleeping

App Files Files Community

python-dependency-compatibility-board / ML_MODELS_README.md

Yash Sakhale

Initial commit: Python Dependency Compatibility Board with ML and LLM features

329b91e 28 days ago

preview code

raw

history blame contribute delete

4.34 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

ML Models Integration Guide

This document explains how to train and use the ML models for conflict prediction and package similarity.

Overview

The project includes two ML models:

Conflict Prediction Model: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
Package Embeddings: Pre-computed semantic embeddings for common Python packages for similarity matching

Training the Models

Step 1: Install Training Dependencies

pip install scikit-learn sentence-transformers numpy

Step 2: Train Conflict Prediction Model

cd "code to upload"
python train_conflict_model.py

This will:

Load the synthetic dataset (synthetic_requirements_dataset.json)
Extract features from requirements
Train a Random Forest classifier
Save the model to models/conflict_predictor.pkl
Display accuracy and feature importance

Expected Output:

Model size: ~2-5 MB
Test accuracy: ~85-95% (depending on dataset)

Step 3: Generate Package Embeddings

python generate_embeddings.py

This will:

Load a sentence transformer model
Generate embeddings for common Python packages
Save embeddings to models/package_embeddings.json
Save model info to models/embedding_info.json

Expected Output:

Embeddings file: ~5-10 MB
Embedding dimension: 384
Number of packages: ~100+

Model Files Structure

After training, you should have:

code to upload/
├── models/
│   ├── conflict_predictor.pkl      # Classification model
│   ├── package_embeddings.json     # Pre-computed embeddings
│   └── embedding_info.json         # Model metadata

Integration in Main App

The models are automatically loaded when available:

Conflict Prediction: Runs before detailed analysis to provide early warnings
Package Similarity: Enhances spell-checking with semantic matching

Features

Graceful Fallback: If models aren't available, the app works with rule-based methods
Lazy Loading: Models load only when needed
Error Handling: ML failures don't break the app

Usage in Code

Conflict Prediction

from ml_models import ConflictPredictor

predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)

if has_conflict:
    print(f"Conflict predicted with {confidence:.1%} confidence")

Package Similarity

from ml_models import PackageEmbeddings

embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]

best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'

Hugging Face Spaces Deployment

Option 1: Include Models in Repo

Train models locally
Commit model files to the repo
Models load automatically on Spaces

Pros: Simple, no external dependencies
Cons: Larger repo size (~10-15 MB)

Option 2: Upload to Hugging Face Hub

Train models locally

Upload to Hugging Face Hub:

from huggingface_hub import upload_file
upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")

Load from Hub in app:

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")

Pros: Smaller repo, version control for models
Cons: Requires internet connection at startup

Performance

Conflict Prediction: <10ms per prediction
Embedding Lookup: <1ms (pre-computed) or ~50ms (on-the-fly)
Model Loading: ~1-2 seconds at startup

Troubleshooting

Models Not Loading

Check that models/ directory exists
Verify model files are present
Check file permissions

Low Prediction Accuracy

Retrain with more data
Adjust feature engineering
Try different model parameters

Embeddings Not Working

Ensure sentence-transformers is installed
Check internet connection (for first-time model download)
Verify embeddings file format

Future Improvements

Train on larger, real-world dataset
Add version-specific embeddings
Implement online learning
Add confidence intervals
Support for custom model paths