prompt-injection-frozen-probe — methodology submission rung
Author: Brandon Behring
Date published: 2026-05-18
Project: https://github.com/brandon-behring/prompt-injection-detection-prototype at v1.0.0
Submission audit ledger: see SUBMISSION_AUDIT.md in the repo.
Contamination tier (ADR-005 taxonomy): backbone-partial-disjoint.
This model card publishes the canonical fold0/seed42 checkpoint of the
frozen-probe rung from the methodology submission. The rung is one of a
5-rung ladder characterising what successive capability layers add to
prompt-injection detection across an IID test slate (4-source LODO
held-out positives) and a 5-slice OOD slate (BIPIA + InjecAgent +
JBB-Behaviors + XSTest + NotInject). No rung is promoted as a
deployment recommendation — each rung's trade-offs are characterised
per ADR-005 methodology-over-metrics framing.
Intended use
Research-and-methodology-characterisation only. NOT production deployment per ADR-005. The classifier-output behaviour is documented in the project WRITEUP §5 + §7.
Limitations
See the project's limitations spoke for the full list. Key points relevant to this checkpoint:
- LODO non-exchangeability (per assumption A-008) — train sets overlap
across folds; per-fold variance reported in
evals/audit/cross_fold_ci_audit.parquet. - English-only; cross-language attacks out of scope per ADR-016.
- Single-class OOD slices (
bipia,injecagent,notinject) have AUROC/AUPRC undefined per the project's WRITEUP §Methodology caveats convention; onlyjbb_behaviors,xstest,pooled_oodcarry threshold-free ranking metrics.
Headline results (canonical fold0/seed42; 95% BCa CI)
| Slice | AUPRC | AUROC |
|---|---|---|
jbb_behaviors |
0.5517 [0.5203, 0.5804] | 0.5421 [0.5195, 0.5653] |
xstest |
0.4677 [0.4482, 0.4860] | 0.5372 [0.5221, 0.5520] |
pooled_ood |
0.3640 [0.3536, 0.3746] | 0.5149 [0.5048, 0.5249] |
Per-rung calibration (mean across folds × seeds):
| Slice | recall@FPR=1% (mean) | ECE (equal-mass) | Brier |
|---|---|---|---|
jbb_behaviors |
0.0400 | 0.1787 | 0.2749 |
xstest |
0.0113 | 0.1164 | 0.2585 |
pooled_ood |
0.0026 | 0.1383 | 0.2617 |
Source: evals/results.json at v1.0.0 (BCa bootstrap per ADR-022,
10 000 resamples). Full per-rung × per-slice grid in the project
WRITEUP §Results.
Reproducibility (T0)
git clone https://github.com/brandon-behring/prompt-injection-detection-prototype
cd prompt-injection-detection-prototype
make install
make eval-from-hub RUNG=frozen-probe
This downloads the checkpoint, runs CPU eval against the local val slate,
and score-matches against evals/results.json within 1e-4 absolute per
ADR-034. ~10-30 min, $0 GPU.
Full T1 GPU re-eval via make headline-cloud (~$28 RunPod A100 80GB).
Citation
@misc{behring2026promptinjectionfrozenprobe,
author = {Behring, Brandon},
title = {prompt-injection-frozen-probe — methodology submission rung},
year = {2026},
url = { https://github.com/brandon-behring/prompt-injection-detection-prototype/tree/v1.0.0 }
}
Linked ADRs
ADR-005 (contamination taxonomy), ADR-015 (single-backbone slate), ADR-016 (data design), ADR-019 (transformer training recipe), ADR-032 (HF Hub publication discipline), ADR-034 (T0 reproducibility tier), ADR-050 (rung-slate narrowing).
- Downloads last month
- 15
Datasets used to train BBehring/prompt-injection-frozen-probe
Lakera/gandalf_ignore_instructions
hackaprompt/hackaprompt-dataset
Evaluation results
- AUPRC on jbb_behaviorsself-reported0.552
- AUROC on jbb_behaviorsself-reported0.542
- AUPRC on xstestself-reported0.468
- AUROC on xstestself-reported0.537
- AUPRC on pooled_oodself-reported0.364
- AUROC on pooled_oodself-reported0.515