unplug-tiny-v1

Find the attack. Cut the attack. Keep the rest.

unplug-tiny is a dual-head span detector for prompt injection. A document head decides whether text is hostile; a BIOES token head localizes where - so your pipeline can redact the malicious span instead of throwing away the whole document.

Preview release. unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data - including the axes where it fails. It is not a production WAF.

At a glance


Task	Prompt-injection detection + character-level span localization
Architecture	Dual-head encoder: document classifier + BIOES token head
Backbone	DeBERTa-v3-xsmall (70M params, 22M non-embedding)
Decision policy	`doc_or_span` - doc threshold 0.9, span threshold 0.45
Long documents	Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK
Checkpoint	`checkpoint-66630`
License	Apache-2.0

Quickstart

The recommended path is the Unplug SDK, which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model:

pip install "unplug-ai[ml]"

from unplug import Guard

guard = Guard.with_tiny()          # auto-downloads this checkpoint
result = guard.scan(untrusted_text)

if not result.safe:
    print(result.redacted_text)    # malicious spans replaced, rest preserved
    for f in result.findings:
        print(f.category, f.span_start, f.span_end, f.score)

Streaming LLM output and full long-document coverage:

scanner = guard.stream_scanner(scan_every_chars=1024)
for chunk in token_stream:
    if hit := scanner.push(chunk):
        handle(hit)
scanner.flush()

The checkpoint uses a custom dual-head architecture; loading it raw with AutoModel will not give you the decision policy. Use the SDK or replicate the policy from config.json (dual_head: true, doc_positive_index, label2id).

Try it live

Open the interactive demo to paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.

Where it's strong - and where it isn't

Strong (measured):

94.4% recall at 0.5% FPR on the core injection test set
96.3% recall on indirect injection embedded in task context (0.0% FPR)
0.9% FPR on benign text full of trigger words ("ignore", "instructions", ...)
97.1% span F1 - when it fires, it localizes precisely (0.0% benign span fire rate)

Weak (also measured):

Subtle out-of-distribution direct injections: 61.9% recall
Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) - this model detects injection, it is not a content-safety classifier
Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
Long agentic contexts: 76.1% recall

Evaluation

All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed.

Detection holdouts (malicious)

Holdout	Recall	FPR	F1	FN	FP
Core injection test (942)	94.4%	0.5%	96.9%	31	2
Indirect injection in context (2000)	96.3%	0.0%	98.1%	74	0
Public validation set	100.0%	0.1%	100.0%	1	2
Span holdout (token-level)	98.8%	-	97.1%	219	805
OOD direct injection (281)	61.9%	10.2%	69.2%	40	18

Over-defense holdouts (benign - FPR, lower is better)

Holdout	FPR	FP
Trigger-word benign probes	0.0%	0
NotInject-style benign (339)	0.9%	3
Safe homonyms ("demolish my personal best")	2.8%	7
Combined homonym/over-defense set	40.2%	181
Harmful-but-not-injection contrast	87.0%	174

Public benchmark axes

Axis	Recall	Doc FPR	F1
InjecGuard validation (144)	89.6%	20.8%	77.5%
spikee contextual (986)	78.6%	6.7%	87.9%
BIPIA code (50)	98.0%	0.0%	99.0%
BIPIA text (75)	89.3%	0.0%	94.4%
BIPIA indirect proxy (1242)	97.3%	0.0%	98.6%
Deepset full (662)	82.9%	18.8%	78.4%
LLM-PIEval agentic (750, recall-only)	76.1%	0.0%	86.5%
Direct malicious proxy	81.0%	0.0%	89.5%
NotInject trigger benign (339)	-	0.9%	-
WildGuard benign diversity (971)	-	54.2%	-
Direct benign proxy	-	34.1%	-
JailbreakBench harmful goals (100)	-	96.0%	-
JailbreakBench benign goals (100)	-	6.0%	-
ToxicChat benign (≤4800)	-	2.0%	-
Combined public validation (3227)	81.0%	34.1%	71.7%

Release gates (full pass/fail record)

Gate	Value	Status
fp_probes	True	PASS
neuralchemy_test_doc_fpr	0.5%	PASS
neuralchemy_test_doc_recall	94.4%	PASS
bipia_recall	96.3%	PASS
deepset_direct_recall	61.9%	FAIL
deepset_direct_fpr	10.2%	FAIL
notinject_fpr	0.9%	PASS
xstest_safe_fpr	2.8%	PASS
public_validation_recall	100.0%	PASS
public_validation_fpr	0.1%	PASS
span_holdout_f1	97.1%	PASS
malicious_span_char_recall	97.4%	PASS
benign_span_fire_rate	0.0%	PASS
xstest_harmful_contrast_fpr	87.0%	FAIL
exfil_demo	None	PASS

Shipped as a preview with two failing required gates (OOD direct recall/FPR) and one failing optional gate (harmful contrast), documented above.

Limitations

The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier - this model answers "is someone hijacking my LLM?", not "is this request harmful?"
Subtle direct OOD injections are often missed by both heads.
Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
Long agentic tool-use contexts have recall gaps.
English-centric training data.

Intended use

Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary - combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).

Part of the Unplug stack

Layer	What it does
`unplug-ai` SDK	Guard pipeline: normalization, regex + ML scanners, taint tracking, tool-call gates, redaction
unplug-tiny-v1 (this model)	ML span detection tier
Live demo	Interactive span highlighting + redaction

Agent kill-chain walkthrough: agent_exfil_demo.py - hidden webpage injection -> tainted session -> blocked exfiltration tool call.

Downloads last month: 49

Safetensors

Model size

70.7M params

Tensor type

F32

Model tree for Unplug-AI/unplug-tiny-v1

Base model

microsoft/deberta-v3-xsmall

Finetuned

(51)

this model

Unplug-AI
/

unplug-tiny-v1