maurocarlu's picture
adding Production links to the root Readme
fab0e43
metadata
title: Hopcroft Skill Classification
emoji: 🧠
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
api_docs_url: /docs

Hopcroft Skill Classification

CI Pipeline Hugging Face Spaces MLflow

Multi-label skill classification for GitHub issues and pull requests β€” Automatically identify technical skills required to resolve software issues using machine learning.


Overview

Hopcroft is an ML-enabled system that classifies GitHub issues into 217 technical skill categories, enabling automated developer assignment and optimized resource allocation. Built following professional MLOps and Software Engineering standards.

Key Features

  • 🎯 Multi-label Classification: Predict multiple skills per issue
  • πŸš€ REST API: FastAPI with Swagger documentation
  • πŸ–₯️ Web Interface: Streamlit GUI for interactive predictions
  • πŸ“Š Monitoring: Prometheus/Grafana dashboards with drift detection
  • πŸ”„ CI/CD: GitHub Actions with Docker deployment
  • πŸ“ˆ Experiment Tracking: MLflow on DagsHub

Architecture

graph TB
    subgraph "Data Layer"
        A[(SkillScope DB)] --> B[Feature Engineering]
        B --> C[TF-IDF / Embeddings]
    end
    
    subgraph "ML Pipeline"
        C --> D[Model Training]
        D --> E[(MLflow Tracking)]
        D --> F[Random Forest Model]
    end
    
    subgraph "Serving Layer"
        F --> G[FastAPI Service]
        G --> H[predict endpoint]
        G --> I[predictions endpoint]
        G --> J[health endpoint]
    end
    
    subgraph "Frontend"
        G --> K[Streamlit GUI]
    end
    
    subgraph "Monitoring"
        G --> L[Prometheus]
        L --> M[Grafana]
        N[Drift Detection] --> L
    end
    
    subgraph "Deployment"
        O[GitHub Actions] --> P[Docker Build]
        P --> Q[HF Spaces]
    end

Documentation

Document Description
πŸ“‹ Milestone Summaries All 6 project phases documented
πŸ“– User Guide Setup, API, GUI, testing, monitoring
πŸ—οΈ Design Choices Technical decisions & rationale
🎯 ML Canvas Requirements engineering framework
βœ… Testing & Validation QA strategy & results
πŸ“Š Model Card Model details & performance
πŸ“Š Dataset Card Dataset details & preprocessing

Quick Start

Docker (Recommended)

# Clone and configure
git clone https://github.com/se4ai2526-uniba/Hopcroft.git
cd Hopcroft
cp .env.example .env
# Edit .env with your DagsHub credentials

# Start services
docker compose -f docker/docker-compose.yml up -d --build

Access (Local):

Local Development

# Setup environment
python -m venv venv && source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt && pip install -e .

# Start API
make api-dev

# Start GUI (new terminal)
streamlit run hopcroft_skill_classification_tool_competition/streamlit_app.py

Project Structure

β”œβ”€β”€ hopcroft_skill_classification_tool_competition/
β”‚   β”œβ”€β”€ main.py              # FastAPI application
β”‚   β”œβ”€β”€ streamlit_app.py     # Streamlit GUI
β”‚   β”œβ”€β”€ features.py          # Feature engineering
β”‚   β”œβ”€β”€ modeling/            # Training & prediction
β”‚   └── config.py            # Configuration
β”œβ”€β”€ data/                    # DVC-tracked datasets
β”œβ”€β”€ models/                  # DVC-tracked models
β”œβ”€β”€ tests/                   # Pytest test suites
β”œβ”€β”€ monitoring/              # Prometheus, Grafana, Locust
β”œβ”€β”€ docker/                  # Docker configurations
β”œβ”€β”€ docs/                    # Documentation
└── .github/workflows/       # CI/CD pipelines

API Endpoints

Method Endpoint Description
POST /predict Classify single issue
POST /predict/batch Batch classification
GET /predictions List recent predictions
GET /predictions/{id} Get by MLflow run ID
GET /health Health check
GET /metrics Prometheus metrics

Example:

curl -X POST "http://localhost:8080/predict" \
  -H "Content-Type: application/json" \
  -d '{"issue_text": "Fix OAuth2 authentication bug"}'

Live Deployment


Development

# Run tests
make test-all              # All tests
make test-behavioral       # ML behavioral tests
make validate-deepchecks   # Data validation

# Lint & format
make lint                  # Check code style
make format                # Auto-fix issues

# Training
make train-baseline-tfidf  # Train baseline model

License

This project was developed as part of the SE4AI 2025-26 course at the University of Bari.