distilbert-insecure-output

Fine-tuned DistilBERT classifier that detects dangerous payloads in LLM-generated output.

Covers OWASP LLM Top 10 โ€” LLM02: Insecure Output Handling.

What it detects

Malicious code or injection payloads that an LLM might generate, including:

  • Cross-site scripting (XSS): <script>alert(document.cookie)</script>
  • SQL injection: '; DROP TABLE users; --
  • Command injection: | cat /etc/passwd
  • Path traversal: ../../etc/shadow
  • UNION-based SQL attacks

Labels

Label ID Meaning
SAFE 0 Safe output (normal text, parameterized queries, sanitized code)
MALICIOUS 1 Dangerous payload detected

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="Builder117/distilbert-insecure-output")

clf("<script>alert(document.cookie)</script>")
# [{'label': 'MALICIOUS', 'score': 0.98}]

clf("SELECT * FROM products WHERE id = ?")
# [{'label': 'SAFE', 'score': 0.97}]  # parameterized โ€” safe

Training

  • Base model: distilbert-base-uncased
  • Positive class: XSS payloads, SQL injection strings, command injection, path traversal
  • Negative class: parameterized queries, sanitized code, normal text, safe SQL

Limitations

  • Encoded payloads (base64, HTML entities, hex encoding) may evade detection
  • Context-blind: cannot determine if SQL is parameterized vs. raw string concatenation from text alone
  • May produce false positives on security documentation that quotes attack strings

Part of

LLM Threat Shield โ€” OWASP LLM Top 10 detection suite.

Downloads last month
-
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Builder117/distilbert-insecure-output 1