prompt-injection-frozen-probe — methodology submission rung

Author: Brandon Behring Date published: 2026-05-18 Project: https://github.com/brandon-behring/prompt-injection-detection-prototype at v1.0.0 Submission audit ledger: see SUBMISSION_AUDIT.md in the repo. Contamination tier (ADR-005 taxonomy): backbone-partial-disjoint.

This model card publishes the canonical fold0/seed42 checkpoint of the frozen-probe rung from the methodology submission. The rung is one of a 5-rung ladder characterising what successive capability layers add to prompt-injection detection across an IID test slate (4-source LODO held-out positives) and a 5-slice OOD slate (BIPIA + InjecAgent + JBB-Behaviors + XSTest + NotInject). No rung is promoted as a deployment recommendation — each rung's trade-offs are characterised per ADR-005 methodology-over-metrics framing.

Intended use

Research-and-methodology-characterisation only. NOT production deployment per ADR-005. The classifier-output behaviour is documented in the project WRITEUP §5 + §7.

Limitations

See the project's limitations spoke for the full list. Key points relevant to this checkpoint:

LODO non-exchangeability (per assumption A-008) — train sets overlap across folds; per-fold variance reported in evals/audit/cross_fold_ci_audit.parquet.
English-only; cross-language attacks out of scope per ADR-016.
Single-class OOD slices (bipia, injecagent, notinject) have AUROC/AUPRC undefined per the project's WRITEUP §Methodology caveats convention; only jbb_behaviors, xstest, pooled_ood carry threshold-free ranking metrics.

Headline results (canonical fold0/seed42; 95% BCa CI)

Slice	AUPRC	AUROC
`jbb_behaviors`	0.5517 [0.5203, 0.5804]	0.5421 [0.5195, 0.5653]
`xstest`	0.4677 [0.4482, 0.4860]	0.5372 [0.5221, 0.5520]
`pooled_ood`	0.3640 [0.3536, 0.3746]	0.5149 [0.5048, 0.5249]

Per-rung calibration (mean across folds × seeds):

Slice	recall@FPR=1% (mean)	ECE (equal-mass)	Brier
`jbb_behaviors`	0.0400	0.1787	0.2749
`xstest`	0.0113	0.1164	0.2585
`pooled_ood`	0.0026	0.1383	0.2617

Source: evals/results.json at v1.0.0 (BCa bootstrap per ADR-022, 10 000 resamples). Full per-rung × per-slice grid in the project WRITEUP §Results.

Reproducibility (T0)

git clone https://github.com/brandon-behring/prompt-injection-detection-prototype
cd prompt-injection-detection-prototype
make install
make eval-from-hub RUNG=frozen-probe

This downloads the checkpoint, runs CPU eval against the local val slate, and score-matches against evals/results.json within 1e-4 absolute per ADR-034. ~10-30 min, $0 GPU.

Full T1 GPU re-eval via make headline-cloud (~$28 RunPod A100 80GB).

Citation

@misc{behring2026promptinjectionfrozenprobe,
  author       = {Behring, Brandon},
  title        = {prompt-injection-frozen-probe — methodology submission rung},
  year         = {2026},
  url          = { https://github.com/brandon-behring/prompt-injection-detection-prototype/tree/v1.0.0 }
}

Linked ADRs

ADR-005 (contamination taxonomy), ADR-015 (single-backbone slate), ADR-016 (data design), ADR-019 (transformer training recipe), ADR-032 (HF Hub publication discipline), ADR-034 (T0 reproducibility tier), ADR-050 (rung-slate narrowing).

Downloads last month: 15

Safetensors

Model size

0.1B params

Tensor type

BF16

Datasets used to train BBehring/prompt-injection-frozen-probe

Evaluation results

AUPRC on jbb_behaviors
self-reported

0.552
AUROC on jbb_behaviors
self-reported

0.542
AUPRC on xstest
self-reported

0.468
AUROC on xstest
self-reported

0.537
AUPRC on pooled_ood
self-reported

0.364
AUROC on pooled_ood
self-reported

0.515