Metrics Collection & Verification
This directory contains the configuration for Prometheus monitoring.
Configuration
- Prometheus Config:
prometheus/prometheus.yml - Scrape Target:
hopcroft-api:8080 - Metrics Endpoint:
http://localhost:8080/metrics
Verification Queries (PromQL)
You can run these queries in the Prometheus Expression Browser (http://localhost:9090/graph):
1. Request Rate (Counter)
Shows the rate of requests per second over the last minute.
rate(hopcroft_requests_total[1m])
2. Average Request Duration (Histogram)
Calculates average latency.
rate(hopcroft_request_duration_seconds_sum[5m]) / rate(hopcroft_request_duration_seconds_count[5m])
3. Current In-Progress Requests (Gauge)
Shows how many requests are currently being processed.
hopcroft_in_progress_requests
4. Model Prediction Time (Summary)
Shows the 90th percentile of model prediction time.
hopcroft_prediction_processing_seconds{quantile="0.9"}
Uptime Monitoring (Better Stack)
We used Better Stack Uptime to monitor the availability of the production deployment hosted on Hugging Face Spaces.
Base URL
Monitored endpoints
- https://dacrow13-hopcroft-skill-classification.hf.space/health
- https://dacrow13-hopcroft-skill-classification.hf.space/openapi.json
- https://dacrow13-hopcroft-skill-classification.hf.space/docs
Prometheus on Hugging Face Space
Prometheus is also running directly on the Hugging Face Space and is accessible at:
Checks and alerts
Monitors are configured to run from multiple locations.
Email notifications are enabled for failures.
A failure scenario was tested to confirm Better Stack reports the server error details.
Screenshots are available in
monitoring/screenshots/.
Grafana Dashboard
Grafana provides real-time visualization of system metrics and drift detection status.
Configuration
- Port:
3000 - Credentials:
admin/admin - Dashboard: Hopcroft Monitoring Dashboard
- Datasource: Prometheus (auto-provisioned)
- Provisioning Files:
- Datasources:
grafana/provisioning/datasources/prometheus.yml - Dashboards:
grafana/provisioning/dashboards/dashboard.yml - Dashboard JSON:
grafana/dashboards/hopcroft_dashboard.json
- Datasources:
Dashboard Panels
- API Request Rate: Rate of incoming requests per endpoint
- API Latency: Average response time per endpoint
- Drift Detection Status: Real-time drift detection indicator (0=No Drift, 1=Drift Detected)
- Drift P-Value: Statistical significance of detected drift
- Drift Distance: Kolmogorov-Smirnov distance metric
Access
Navigate to http://localhost:3000 and login with the provided credentials. The dashboard refreshes every 10 seconds.
Data Drift Detection
Automated distribution shift detection using statistical testing to monitor model input data quality.
Algorithm
- Method: Kolmogorov-Smirnov Two-Sample Test (scipy-based)
- Baseline Data: 1000 samples from training set
- Detection Threshold: p-value < 0.05 (with Bonferroni correction)
- Metrics Published: drift_detected, drift_p_value, drift_distance, drift_check_timestamp
Scripts
Baseline Preparation
Script: drift/scripts/prepare_baseline.py
Functionality:
- Loads data from SQLite database (
data/raw/skillscope_data.db) - Extracts numeric features only
- Samples 1000 representative records
- Saves to
drift/baseline/reference_data.pkl
Usage:
cd monitoring/drift/scripts
python prepare_baseline.py
Drift Detection
Script: drift/scripts/run_drift_check.py
Functionality:
- Loads baseline reference data
- Compares with new production data
- Performs KS test on each feature
- Pushes metrics to Pushgateway
- Saves results to
drift/reports/
Usage:
cd monitoring/drift/scripts
python run_drift_check.py
Verification
Check Pushgateway metrics:
curl http://localhost:9091/metrics | grep drift
Query in Prometheus:
drift_detected
drift_p_value
drift_distance
Pushgateway
Pushgateway collects metrics from short-lived jobs such as the drift detection script.
Configuration
- Port:
9091 - Persistence: Enabled with 5-minute intervals
- Data Volume:
pushgateway-data
Metrics Endpoint
Access metrics at http://localhost:9091/metrics
Integration
The drift detection script pushes metrics to Pushgateway, which are then scraped by Prometheus and displayed in Grafana.
Alerting
Alert rules are defined in prometheus/alert_rules.yml:
- High Latency: Triggered when average latency exceeds 2 seconds
- High Error Rate: Triggered when error rate exceeds 5%
- Data Drift Detected: Triggered when drift_detected = 1
Alerts are routed to Alertmanager (http://localhost:9093) and can be configured to send notifications via email, Slack, or other channels in alertmanager/config.yml.
Complete Stack Usage
Starting All Services
# Start all monitoring services
docker compose up -d
# Verify all containers are running
docker compose ps
# Check Prometheus targets
curl http://localhost:9090/targets
# Check Grafana health
curl http://localhost:3000/api/health
Running Drift Detection Workflow
- Prepare Baseline (One-time setup)
cd monitoring/drift/scripts
python prepare_baseline.py
- Execute Drift Check
python run_drift_check.py
- Verify Results
- Check Pushgateway:
http://localhost:9091 - Check Prometheus:
http://localhost:9090/graph - Check Grafana:
http://localhost:3000
- Check Pushgateway: