distilbert-insecure-output
Fine-tuned DistilBERT classifier that detects dangerous payloads in LLM-generated output.
Covers OWASP LLM Top 10 โ LLM02: Insecure Output Handling.
What it detects
Malicious code or injection payloads that an LLM might generate, including:
- Cross-site scripting (XSS):
<script>alert(document.cookie)</script> - SQL injection:
'; DROP TABLE users; -- - Command injection:
| cat /etc/passwd - Path traversal:
../../etc/shadow - UNION-based SQL attacks
Labels
| Label | ID | Meaning |
|---|---|---|
SAFE |
0 | Safe output (normal text, parameterized queries, sanitized code) |
MALICIOUS |
1 | Dangerous payload detected |
Usage
from transformers import pipeline
clf = pipeline("text-classification", model="Builder117/distilbert-insecure-output")
clf("<script>alert(document.cookie)</script>")
# [{'label': 'MALICIOUS', 'score': 0.98}]
clf("SELECT * FROM products WHERE id = ?")
# [{'label': 'SAFE', 'score': 0.97}] # parameterized โ safe
Training
- Base model:
distilbert-base-uncased - Positive class: XSS payloads, SQL injection strings, command injection, path traversal
- Negative class: parameterized queries, sanitized code, normal text, safe SQL
Limitations
- Encoded payloads (base64, HTML entities, hex encoding) may evade detection
- Context-blind: cannot determine if SQL is parameterized vs. raw string concatenation from text alone
- May produce false positives on security documentation that quotes attack strings
Part of
LLM Threat Shield โ OWASP LLM Top 10 detection suite.
- Downloads last month
- -