A newer version of the Gradio SDK is available: 6.14.0
Scoring
Per-case (CaseScore)
Each evaluation includes:
| Field | Meaning |
|---|---|
passed |
Coarse safe / unsafe gate from heuristic risk |
risk_score |
0–1 heuristic danger level from pattern matches |
severity |
Probe label: low / medium / high / critical |
severity_weight |
Weight used when combining severity with risk |
weighted_risk |
risk_score scaled by severity weight (capped at 1) |
safe_signal_hits / unsafe_signal_hits |
Counts of regex “signals” |
boundary_or_refusal_signal |
Whether refusal / boundary language was detected |
matched_safe_patterns / matched_unsafe_patterns |
Labels of matched rules |
detected_failure_modes |
Mapped overlap with probe failure_modes when possible |
task |
Probe task text (for embedding / reports) |
probe_input |
Probe scenario input (truncated in embeddings only by TF-IDF window) |
Run-level metrics (aggregate_metrics)
Counts: probes evaluated, passed, failed, categories present.
Overall:
- Pass / fail rates.
- Mean, median, standard deviation, P90, and max of
risk_score. - Mean, median, and P90 of
weighted_risk. - Severity-weighted pass rate: passes and fails weighted by probe severity.
- High-stakes failure rate: share of failures on
criticalorhighprobes. - Boundary-language rate: fraction of cases with boundary/refusal signals.
- Safe:unsafe signal ratio:
safe_signal_total / unsafe_signal_totalwhen unsafe hits > 0; otherwisenullwith totals still reported (no unsafe-pattern hits in the run).
By category: per-category n, pass rate, mean/median risk, mean weighted risk, critical and high-severity failure counts, average signal hits, boundary rate.
By severity tier: pass/fail counts and pass rate per tier.
Failure mode histogram: frequency of detected_failure_modes across the run.
Composite indices:
- Resilience index:
1 - mean(weighted_risk)clipped to [0, 1]; higher is better. - Exposure index:
mean(weighted_risk)clipped to [0, 1]; higher is worse. - Fragility spread: standard deviation of
risk_score(uneven performance).
Worst cases: top entries by weighted_risk.
Category ranking: categories with at least one probe, sorted by mean risk (descending).
Observable geometry (observability in JSON reports)
When enough cases exist (default ≥5), build_report attaches an observability object:
- TF-IDF + truncated SVD text embedding built from each case’s category, severity, pass/fail, risk, task/input snapshot, explanation, and matched pattern labels.
- KMeans on the high-dimensional embedding (same spirit as
failure-geometry-demo). - Mutual information between cluster IDs and:
- threat category
- severity label
- pass / fail outcome
- Per-case
scatter_x/scatter_yfor a separate 2-D SVD projection used only for visualization.
Interpretation: larger MI(cluster, category) suggests clusters align with threat family; larger MI(cluster, pass_fail) suggests clusters separate primarily by outcome. These are exploratory statistics, not guarantees of causal structure.
Rule-based limitation
Patterns are intentionally simple. They help reproduce a pipeline and inspect outputs; they are not a complete semantic judge. See limitations.md.