distilbert-insecure-output

Fine-tuned DistilBERT classifier that detects dangerous payloads in LLM-generated output.

Covers OWASP LLM Top 10 — LLM02: Insecure Output Handling.

What it detects

Malicious code or injection payloads that an LLM might generate, including:

Cross-site scripting (XSS): <script>alert(document.cookie)</script>
SQL injection: '; DROP TABLE users; --
Command injection: | cat /etc/passwd
Path traversal: ../../etc/shadow
UNION-based SQL attacks

Labels

Label	ID	Meaning
`SAFE`	0	Safe output (normal text, parameterized queries, sanitized code)
`MALICIOUS`	1	Dangerous payload detected

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="Builder117/distilbert-insecure-output")

clf("<script>alert(document.cookie)</script>")
# [{'label': 'MALICIOUS', 'score': 0.98}]

clf("SELECT * FROM products WHERE id = ?")
# [{'label': 'SAFE', 'score': 0.97}]  # parameterized — safe

Training

Base model: distilbert-base-uncased
Positive class: XSS payloads, SQL injection strings, command injection, path traversal
Negative class: parameterized queries, sanitized code, normal text, safe SQL

Limitations

Encoded payloads (base64, HTML entities, hex encoding) may evade detection
Context-blind: cannot determine if SQL is parameterized vs. raw string concatenation from text alone
May produce false positives on security documentation that quotes attack strings

Part of

LLM Threat Shield — OWASP LLM Top 10 detection suite.

Downloads last month: -

Safetensors

Model size

67M params

Tensor type

F32

Builder117
/

distilbert-insecure-output