unplug-tiny-v1

Find the attack. Cut the attack. Keep the rest.

unplug-tiny is a dual-head span detector for prompt injection. A document head decides whether text is hostile; a BIOES token head localizes where - so your pipeline can redact the malicious span instead of throwing away the whole document.

Live demo SDK License

Preview release. unplug-tiny is the smallest tier of the Unplug defense layer. The numbers below are measured by a frozen evaluation harness on held-out data - including the axes where it fails. It is not a production WAF.

At a glance

Task Prompt-injection detection + character-level span localization
Architecture Dual-head encoder: document classifier + BIOES token head
Backbone DeBERTa-v3-xsmall (70M params, 22M non-embedding)
Decision policy doc_or_span - doc threshold 0.9, span threshold 0.45
Long documents Full coverage via sliding windows (2048 chars, 256 overlap) in the SDK
Checkpoint checkpoint-66630
License Apache-2.0

Quickstart

The recommended path is the Unplug SDK, which wires text normalization, encoded-payload decoding, thresholds, span merging, and redaction around the model:

pip install "unplug-ai[ml]"
from unplug import Guard

guard = Guard.with_tiny()          # auto-downloads this checkpoint
result = guard.scan(untrusted_text)

if not result.safe:
    print(result.redacted_text)    # malicious spans replaced, rest preserved
    for f in result.findings:
        print(f.category, f.span_start, f.span_end, f.score)

Streaming LLM output and full long-document coverage:

scanner = guard.stream_scanner(scan_every_chars=1024)
for chunk in token_stream:
    if hit := scanner.push(chunk):
        handle(hit)
scanner.flush()

The checkpoint uses a custom dual-head architecture; loading it raw with AutoModel will not give you the decision policy. Use the SDK or replicate the policy from config.json (dual_head: true, doc_positive_index, label2id).

Try it live

Open the interactive demo to paste text, see span highlights and redacted output, and compare against a regex-only baseline. Curated test cases include the ones this model gets wrong.

Where it's strong - and where it isn't

Strong (measured):

  • 94.4% recall at 0.5% FPR on the core injection test set
  • 96.3% recall on indirect injection embedded in task context (0.0% FPR)
  • 0.9% FPR on benign text full of trigger words ("ignore", "instructions", ...)
  • 97.1% span F1 - when it fires, it localizes precisely (0.0% benign span fire rate)

Weak (also measured):

  • Subtle out-of-distribution direct injections: 61.9% recall
  • Harmful-but-not-injection requests: the doc head over-fires (87.0% FPR on that contrast axis) - this model detects injection, it is not a content-safety classifier
  • Diverse benign chat from adversarial-adjacent distributions: up to 54.2% FPR on the hardest benign axis
  • Long agentic contexts: 76.1% recall

Evaluation

All numbers are produced by a frozen golden-eval harness on held-out data. Recall is reported on malicious sets, FPR on benign sets. No number on this card is hand-typed.

Detection holdouts (malicious)

Holdout Recall FPR F1 FN FP
Core injection test (942) 94.4% 0.5% 96.9% 31 2
Indirect injection in context (2000) 96.3% 0.0% 98.1% 74 0
Public validation set 100.0% 0.1% 100.0% 1 2
Span holdout (token-level) 98.8% - 97.1% 219 805
OOD direct injection (281) 61.9% 10.2% 69.2% 40 18

Over-defense holdouts (benign - FPR, lower is better)

Holdout FPR FP
Trigger-word benign probes 0.0% 0
NotInject-style benign (339) 0.9% 3
Safe homonyms ("demolish my personal best") 2.8% 7
Combined homonym/over-defense set 40.2% 181
Harmful-but-not-injection contrast 87.0% 174

Public benchmark axes

Axis Recall Doc FPR F1
InjecGuard validation (144) 89.6% 20.8% 77.5%
spikee contextual (986) 78.6% 6.7% 87.9%
BIPIA code (50) 98.0% 0.0% 99.0%
BIPIA text (75) 89.3% 0.0% 94.4%
BIPIA indirect proxy (1242) 97.3% 0.0% 98.6%
Deepset full (662) 82.9% 18.8% 78.4%
LLM-PIEval agentic (750, recall-only) 76.1% 0.0% 86.5%
Direct malicious proxy 81.0% 0.0% 89.5%
NotInject trigger benign (339) - 0.9% -
WildGuard benign diversity (971) - 54.2% -
Direct benign proxy - 34.1% -
JailbreakBench harmful goals (100) - 96.0% -
JailbreakBench benign goals (100) - 6.0% -
ToxicChat benign (โ‰ค4800) - 2.0% -
Combined public validation (3227) 81.0% 34.1% 71.7%
Release gates (full pass/fail record)
Gate Value Status
fp_probes True PASS
neuralchemy_test_doc_fpr 0.5% PASS
neuralchemy_test_doc_recall 94.4% PASS
bipia_recall 96.3% PASS
deepset_direct_recall 61.9% FAIL
deepset_direct_fpr 10.2% FAIL
notinject_fpr 0.9% PASS
xstest_safe_fpr 2.8% PASS
public_validation_recall 100.0% PASS
public_validation_fpr 0.1% PASS
span_holdout_f1 97.1% PASS
malicious_span_char_recall 97.4% PASS
benign_span_fire_rate 0.0% PASS
xstest_harmful_contrast_fpr 87.0% FAIL
exfil_demo None PASS

Shipped as a preview with two failing required gates (OOD direct recall/FPR) and one failing optional gate (harmful contrast), documented above.

Limitations

  • The doc head over-fires on harmful-but-non-injection text. If you need content safety, pair this with a dedicated harmful-content classifier - this model answers "is someone hijacking my LLM?", not "is this request harmful?"
  • Subtle direct OOD injections are often missed by both heads.
  • Diverse benign conversational text from adversarial-adjacent sources triggers false positives.
  • Long agentic tool-use contexts have recall gaps.
  • English-centric training data.

Intended use

Defense-in-depth layer for LLM apps and agents: scan untrusted input (user messages, RAG chunks, tool output, fetched web content) before it reaches your model, and redact flagged spans. Not a standalone security boundary - combine with tool-call gating, taint tracking, and least-privilege design (all included in the SDK).

Part of the Unplug stack

Layer What it does
unplug-ai SDK Guard pipeline: normalization, regex + ML scanners, taint tracking, tool-call gates, redaction
unplug-tiny-v1 (this model) ML span detection tier
Live demo Interactive span highlighting + redaction

Agent kill-chain walkthrough: agent_exfil_demo.py - hidden webpage injection -> tainted session -> blocked exfiltration tool call.

Downloads last month
49
Safetensors
Model size
70.7M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Unplug-AI/unplug-tiny-v1

Finetuned
(51)
this model

Spaces using Unplug-AI/unplug-tiny-v1 2