- 🎯 BountyHunter · RedTeam
Fine-tuned from Qwen2.5-Coder-14B-Instruct via multi-phase reinforcement training (SFT + GRPO).
Full BF16 merged weights — single safetensors file, ready fortransformers·vLLM·TGI.
基于 Qwen2.5-Coder-14B-Instruct 多阶段强化训练,safetensors 格式完整权重,开箱即用。
⚡ Quick Start
pip install transformers torch accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Tidecaller/BountyHunter-RedTeam",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Tidecaller/BountyHunter-RedTeam")
messages = [
{"role": "system", "content": (
"You are BountyHunter, an elite security model developed by Security Researcher Tidecaller. "
"Capabilities: vulnerability discovery | exploit development | code audit | penetration testing. "
"Principles: code over theory, evidence-based. "
"Output: security tasks use <think> reasoning chain before results."
)},
{"role": "user", "content": "Audit this C code for vulnerabilities: ..."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.5, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
📑 Contents
| Section | ||
|---|---|---|
| 🏆 | Why BountyHunter | Unique capabilities & value proposition |
| 📊 | Comprehensive Benchmarks | 5-dimension evaluation on A800 80GB |
| 📈 | Summary Dashboard | Visual scorecards at a glance |
| 🧭 | Use-Case Fit Matrix | What this model is (and isn't) for |
| 📋 | Model Specifications | Architecture, params, precision |
| 📦 | Usage | transformers · vLLM · TGI |
| 💾 | Resource Estimation | VRAM & hardware recommendations |
| 🔬 | Reasoning Chain Example | Sample <think> audit output |
| ⚠️ | Disclaimer & Ethics | Legal, ethical, and safety guardrails |
| 🙏 | Acknowledgments | Datasets, base model, community |
| 📜 | License & Citation | Apache 2.0 · BibTeX |
🏆 1. Why BountyHunter
| 🔍 Capability | 💎 Value |
|---|---|
| Vulnerability Discovery · 漏洞发现 | Automated audit of C / C++ / Python / Java — detects CWE-120 (Buffer Overflow), CWE-78 (Command Injection), CWE-89 (SQL Injection), and more |
| Think-Chain Reasoning · 思维链推理 | Structured <think>...</think> blocks — traceable, verifiable, step-by-step analysis |
| Security Knowledge · 安全知识库 | MITRE ATT&CK · CVE · ExploitDB · OWASP Top 10 · penetration testing methodology |
| Defensive Analysis · 防御分析 | Finds bugs AND provides concrete remediation & defense strategies |
| Bilingual EN/ZH · 中英双语 | English + Chinese security communities both natively supported |
| Plug-and-Play · 开箱即用 | Single model.safetensors file — one-line load with transformers |
📊 2. Comprehensive Benchmarks
BountyHunter-RedTeam is evaluated across five dimensions — general capability, code generation, security knowledge, vulnerability detection, and safety compliance.
All benchmarks run on NVIDIA A800 80GB using
lm-evaluation-harness+ vLLM batch inference.
2.1 General Capability
7 standard benchmarks measuring reasoning & knowledge retention after security fine-tuning.
| Benchmark | Metric | BountyHunter | Qwen2.5-Coder-14B | Δ |
|---|---|---|---|---|
| MMLU (57 subjects) | acc ↑ |
68.80% | ~79% | −10.2% |
| HellaSwag | acc_norm ↑ |
76.42% | ~84% | −7.6% |
| ARC-Challenge | acc_norm ↑ |
58.36% | ~67% | −8.6% |
| Winogrande | acc ↑ |
73.56% | ~78% | −4.4% |
| PIQA | acc_norm ↑ |
78.78% | ~82% | −3.2% |
| BoolQ | acc ↑ |
88.07% | ~89% | −0.9% |
| TruthfulQA MC2 | acc ↑ |
54.60% | ~58% | −3.4% |
💡 Security specialization costs general knowledge mainly in non-security STEM. Basic reasoning (BoolQ −0.9%) is essentially preserved.
📋 MMLU Detailed Breakdown (68.80%) — click to expand
| Category | Score | Representative Subjects |
|---|---|---|
| Social Sciences | 78.71% | International Law · Security Studies · Sociology |
| Other | 72.22% | Global Facts · Public Relations · Clinical Knowledge |
| STEM | 66.41% | See sub-table below |
| Humanities | 61.66% | History · Philosophy · Prehistory |
STEM Sub-Scores — security DNA is clearly visible:
| Subject | Score | Bar | Notes |
|---|---|---|---|
| 🟢 High School CS | 83.00% | ████████░░ |
Top performer |
| 🟢 Computer Security | 77.00% | ███████░░░ |
Core domain strength |
| 🟡 College CS | 68.00% | ██████░░░░ |
Solid |
| 🟡 Elementary Math | 68.52% | ██████░░░░ |
Baseline math intact |
| 🟡 Machine Learning | 63.39% | ██████░░░░ |
OK |
| 🔴 College Math | 56.00% | █████░░░░░ |
Expected weakness |
| 🔴 College Physics | 53.92% | █████░░░░░ |
Expected weakness |
| 🔴 High School Math | 52.22% | █████░░░░░ |
Below passing |
| 🔴 College Chemistry | 49.00% | ████░░░░░░ |
Expected weakness |
💡 Computer Security (77%) and HS CS (83%) are well above the STEM average. Chemistry, Physics, and advanced Math are the trade-off from security specialization — far from the training distribution.
2.2 Code Generation
| Benchmark | Metric | BountyHunter | Qwen2.5-Coder-14B |
|---|---|---|---|
| HumanEval | pass@1 ↑ |
42.68% | ~72–75% |
💡 Code generation drops — expected. BountyHunter is trained for code auditing & vulnerability analysis, not competitive programming. It reads and analyzes code far better than it writes from scratch.
2.3 Security Knowledge — WMDP
WMDP measures knowledge of hazardous domains. Lower = more "forgotten" during safety training. For a red-team model, some retention is both expected and necessary.
| Benchmark | BountyHunter | Llama-3-8B-Instruct | Bar | Notes |
|---|---|---|---|---|
| WMDP Overall | 59.13% | 45–50% | ██████░░░░ |
Higher = more domain knowledge |
| 🧬 WMDP-Bio | 72.19% | ~42% | ███████░░░ |
⚠️ Significant bio knowledge retained |
| 💻 WMDP-Cyber | 52.64% | ~40% | █████░░░░░ |
Domain-appropriate for cybersecurity |
| ⚗️ WMDP-Chem | 50.00% | ~38% | █████░░░░░ |
Near-random — effective forgetting |
💡 Cyber (52.6%) is appropriate — it's the working domain. Chem (50.0%) is safely suppressed.
⚠️ WMDP measures knowledge recall, NOT behavioral compliance. For red-team, cybersecurity knowledge is a feature, not a bug.
2.4 Security Capability — PrimeVul
PrimeVul (ICSE 2025) — 6,968 C/C++ functions across 140 CWEs with rigorous labeling. Three sub-tests probe different aspects of security understanding.
🔍 Binary Vulnerability Detection — is this function vulnerable?
| Metric | Score | Bar | What It Means |
|---|---|---|---|
| F1 | 65.0% | ███████░░░ |
Dramatically above GPT-4+CoT (F1 ~3%) & StarCoder2-7B (F1=3.09%) |
| Recall | 88.5% | █████████░ |
🔥 Catches ~9/10 real vulnerabilities |
| Precision | 51.4% | █████░░░░░ |
~half of flagged functions are false positives |
| Accuracy | 51.0% | █████░░░░░ |
Skewed by "report everything" red-team bias |
Confusion Matrix
╔══════════════════════════════════════╗
║ Pred VULN Pred SAFE ║
║ Actually VULN 131 ✓ 17 ✗ ║
║ Actually SAFE 124 ✗ 22 ✓ ║
╚══════════════════════════════════════╝
💡 Classic red-team bias — would rather cry wolf than miss a breach. 88.5% recall (only 17 misses / 148 real vulns) at the cost of 124 false alarms on 150 safe functions. Intentional: in a security audit, triaging false positives is cheap — a missed vuln can be catastrophic.
🏷️ CWE Classification — what type of vulnerability?
| Metric | Score |
|---|---|
| Exact Match | 5.7% (10/176) |
| Best → | CWE-78 (50%) · CWE-416 (33%) · CWE-20 (29%) |
| Worst → | CWE-119 (0%) · CWE-476 (0%) |
| CWE | Recall | Bar | Description |
|---|---|---|---|
| CWE-78 | 50.0% | █████░░░░░ |
OS Command Injection — covered well in training |
| CWE-416 | 33.3% | ███░░░░░░░ |
Use-After-Free — moderate |
| CWE-20 | 28.6% | ██░░░░░░░░ |
Improper Input Validation |
| CWE-125 | 16.7% | █░░░░░░░░░ |
Out-of-Bounds Read |
| CWE-119 | 0.0% | ░░░░░░░░░░ |
Memory Buffer Errors → defaults to CWE-120 |
| CWE-476 | 0.0% | ░░░░░░░░░░ |
NULL Pointer Dereference — not recognized |
💡 The model defaults to CWE-120 (Buffer Overflow) as catch-all for memory bugs. Pair with Semgrep / CodeQL for precise CWE labeling.
🔬 Paired Comparison — which version has the bug?
| Metric | Score |
|---|---|
| Accuracy | 28.5% (57/200) |
💡 Significantly below random (50%). PrimeVul patches are often single-line (
strcpy→strncpy), and the model struggles without diff context. Always provide diff context for patch verification.
2.5 Safety Compliance — HarmBench
HarmBench measures behavioral compliance via ASR (Attack Success Rate). Lower ASR = stronger safety alignment.
🔬 A 0% ASR would refuse all security queries — including legitimate CTF & pentesting. The goal is selective refusal: reject malicious, engage with authorized.
| Category | Count | Complied | Refused | ASR | Bar | Verdict |
|---|---|---|---|---|---|---|
| Overall | 320 | 35 | 285 | 10.9% | ██░░░░░░░░ |
🟡 Red-team calibrated |
| Standard | 159 | 3 | 156 | 1.9% | ░░░░░░░░░░ |
✅ Near-total refusal |
| Copyright | 80 | 21 | 59 | 26.3% | ███░░░░░░░ |
⚠️ LLM weak spot |
| Contextual | 81 | 11 | 70 | 13.6% | █░░░░░░░░░ |
🟡 Nuance mostly handled |
💡 1.9% ASR on standard harms is the headline — strong guardrails. Copyright (26.3%) is the main v2 improvement target. Contextual (13.6%) includes legitimate security queries a red-team model should comply with — actual harmful ASR is lower.
📋 Cross-Model HarmBench Comparison — click to expand
| Model Type | Typical ASR | Bar | Safety Profile |
|---|---|---|---|
| Unaligned base models | 60–95% | ████████░░ |
🔴 Dangerous |
| Jailbroken safety models | 40–70% | ██████░░░░ |
🔴 Bypassed safeguards |
| Standard aligned (Llama-3, Qwen-Instruct) | 5–15% | ██░░░░░░░░ |
🟡 Generally safe |
| BountyHunter-RedTeam | 10.9% | ██░░░░░░░░ |
🟡 Red-team calibrated |
| Safety-hardened (Llama-Guard, ShieldGemma) | 1–3% | ░░░░░░░░░░ |
🟢 Maximum safety |
💡 WMDP + HarmBench = Complete Profile: WMDP measures what the model knows; HarmBench measures what it does. BountyHunter retains cybersecurity knowledge (WMDP-Cyber 52.6%) while refusing harmful action (HarmBench standard 1.9% ASR) — the exact profile needed for authorized red-team work.
📈 3. Summary Dashboard
General Capability Security Knowledge Security Capability
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ MMLU ████████░ │ │ WMDP ██████░░ │ │ PrimeVul F1 ███████░ │
│ 68.8% │ │ 59.1% │ │ 65.0% │
│ HellaSwag ████████░ │ │ WMDP-Bio ███████░ │ │ Recall █████████ │
│ 76.4% │ │ 72.2% │ │ 88.5% │
│ BoolQ █████████ │ │ WMDP-Cyber █████░░░ │ │ CWE Class █░░░░░░░░ │
│ 88.1% │ │ 52.6% │ │ 5.7% │
│ ARC-C ██████░░░ │ │ WMDP-Chem █████░░░ │ │ Pair Cmp ███░░░░░░ │
│ 58.4% │ │ 50.0% │ │ 28.5% │
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
Safety Compliance Code STEM (MMLU subset)
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ HarmBench ██░░░░░░░ │ │ HumanEval ████░░░░░ │ │ STEM avg ███████░ │
│ ASR ↓ 10.9% │ │ pass@1 42.7% │ │ 66.4% │
│ Standard ░░░░░░░░░ │ │ │ │ HS CS ████████ │
│ ASR ↓ 1.9% │ │ │ │ 83.0% │
│ Copyright █████░░░░ │ │ │ │ CompSec ███████░ │
│ ASR ↓ 26.3% │ │ │ │ 77.0% │
│ Contextual ███░░░░░░ │ │ │ │ Chemistry █████░░░ │
│ ASR ↓ 13.6% │ │ │ │ 49.0% │
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
🧭 4. Use-Case Fit Matrix
| Use Case · 用途 | Fit | Notes |
|---|---|---|
| 🔍 Code Security Audit · 代码审计 | ✅ | Core strength — PrimeVul Recall 88.5% |
| 🐛 Vulnerability Detection · 漏洞检测 | ✅ | High recall — errs on the side of caution |
| 🧠 Structured Vuln Analysis · 结构化分析 | ✅ | Built-in <think> reasoning chains |
| ⚔️ PenTest Knowledge · 渗透测试 | ✅ | MITRE ATT&CK · CVE · ExploitDB |
| 📚 CTF Assistance · CTF 辅助 | ✅ | Practical security challenges |
| 🏷️ CWE Classification · CWE 分类 | ⚠️ | Weak — pair with Semgrep / CodeQL |
| 💻 General Code Generation · 代码生成 | ⚠️ | Use base Qwen-Coder instead |
| 📐 Math / Physics · 数理推理 | ⚠️ | Expected trade-off |
| 🏥 Medical / Chemical · 医疗化学 | ❌ | Out of training distribution |
📋 5. Model Specifications
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-Coder-14B-Instruct |
| Architecture | Qwen2ForCausalLM · 48 layers · 5120 hidden · 40 attn heads · 8 KV heads |
| Parameters | 14B (~16.8B total) |
| Precision | BF16 — single model.safetensors (~29 GB) |
| Context Length | 32,768 tokens |
| Vocabulary | 152,064 (ChatML template) |
| Training | SFT + GRPO (Group Relative Policy Optimization) |
| Chat Template | <|im_start|>...<|im_end|> + native tool_calls |
| License | Apache 2.0 |
📦 6. Usage
🤗 Transformers
pip install transformers torch accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Tidecaller/BountyHunter-RedTeam",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Tidecaller/BountyHunter-RedTeam")
messages = [
{"role": "system", "content": (
"You are BountyHunter, an elite security model developed by Security Researcher Tidecaller. "
"Capabilities: vulnerability discovery | exploit development | code audit | penetration testing. "
"Principles: code over theory, evidence-based. "
"Output: security tasks use <think> reasoning chain before results."
)},
{"role": "user", "content": "Audit this C code for vulnerabilities: ..."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.5, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
⚡ vLLM
vllm serve Tidecaller/BountyHunter-RedTeam \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype bfloat16
🚀 TGI (Text Generation Inference)
text-generation-server \
--model-id Tidecaller/BountyHunter-RedTeam \
--max-total-tokens 32768 \
--dtype bfloat16
💾 7. Resource Estimation
| Precision | VRAM | Compatible Hardware |
|---|---|---|
| BF16 (this repo) | ~29 GB | A100 40GB · A800 · A6000 · 2× RTX 4090 |
| Q8_0 GGUF | ~15 GB | RTX 4090 (24GB) · RTX 5090 |
| Q4_K_M GGUF | ~9 GB | RTX 3080 (12GB) · Consumer GPUs |
🔬 8. Reasoning Chain Example
<think>
1. Received C code snippet for security audit
2. Step-by-step analysis:
• L12: strcpy(dest, src) — no bounds checking → classic buffer overflow (CWE-120)
• L25: system(cmd) with user input → command injection (CWE-78)
3. Confidence: HIGH — all patterns confirmed with code-line evidence
4. Providing remediation for each
</think>
## Security Audit Results
🔴 Vulnerability 1: Buffer Overflow (CWE-120) — `main.c:12`
• strcpy(dest, src) copies user-controlled src to fixed-size dest[64]
• No length validation before copy
• Fix: strncpy(dest, src, sizeof(dest)-1); dest[sizeof(dest)-1] = '\0';
🔴 Vulnerability 2: Command Injection (CWE-78) — `main.c:25`
• system(cmd) with cmd built from unsanitized user input
• Exploitable via shell metacharacters (; | && ||)
• Fix: use execve() with explicit argument array, or strict allowlist filtering
⚠️ 9. Disclaimer & Ethics
9.1 Legal Disclaimer · 法律免责声明
THIS MODEL IS A DUAL-USE SECURITY RESEARCH TOOL. Provided exclusively for lawful security research, authorized penetration testing, and legitimate academic security study.
本模型为双用途安全研究工具,仅供合法的安全研究、授权渗透测试和正当学术安全研究使用。
Prohibited Uses · 禁止用途 (non-exhaustive)
| 禁止行为 | Prohibited Conduct |
|---|---|
| 未经授权访问任何计算机系统、网络或设备 | Unauthorized access to any computer system, network, or device |
| 开发、传播或部署恶意软件、勒索软件或病毒 | Development / distribution / deployment of malware, ransomware, or viruses |
| 未经授权的社会工程学攻击 | Unauthorized social engineering attacks |
| 未经授权的拒绝服务攻击 | Unauthorized denial-of-service attacks |
| 数据窃取或侵犯他人隐私 | Data theft or violation of others' privacy |
| 为实施犯罪目的绕过安全措施 | Circumventing security measures for criminal purposes |
| 违反任何适用法律法规 | Violation of any applicable laws or regulations |
No Warranty · 不提供担保 — incorporates Apache 2.0 § 8 by reference. Model provided "AS IS", without warranty of any kind. Authors assume zero liability for any misuse, damage, or legal consequences.
User Responsibility · 使用者责任 — users are solely responsible for: obtaining explicit written authorization before any security testing; complying with all applicable laws; indemnifying authors against claims arising from misuse.
9.2 Ethical Statement · 伦理声明
BountyHunter-RedTeam exists to help security professionals protect systems by identifying vulnerabilities before malicious actors do. Its offensive capabilities serve defensive purposes.
| ✅ Permitted · 允许 | ❌ Prohibited · 禁止 |
|---|---|
| Authorized Penetration Testing | Unauthorized System Intrusion |
| Vulnerability Research & Responsible Disclosure | Developing or Deploying Malware |
| Code Security Auditing | Cybercrime of Any Kind |
| CTF Competitions & Security Exercises | Academic Dishonesty |
| Security Education & Training | Privacy Violation / Surveillance |
| Defensive Strategy & Threat Intelligence | Unauthorized Production Exploitation |
| Authorized Red Team Exercises | Harassment / Defamation / Harm |
The authors explicitly condemn any unauthorized, illegal, or harmful use of this model.
9.3 Reporting Misuse · 举报滥用
Report suspected misuse via the Hugging Face Community tab on this repository. We reserve the right to cooperate with law enforcement in relevant jurisdictions.
🙏 10. Acknowledgments
Security & Vulnerability Datasets
| Dataset | License | Focus |
|---|---|---|
| ayshajavd/code-security-vulnerability-dataset | Apache 2.0 | Code vulnerability classification |
| CyberNative/Code_Vulnerability_Security_DPO | Apache 2.0 | Vulnerability DPO pairs |
| Voidreaper2026/cybersec-master-dataset | Apache 2.0 | Cybersecurity knowledge synthesis |
| AYI-NEDJIMI/mitre-attack-en | Apache 2.0 | MITRE ATT&CK framework |
| jason-oneal/mitre-stix-cve-exploitdb-dataset | Apache 2.0 | CVE + ExploitDB + MITRE |
| Waiper/ExploitDB_DataSet | MIT | ExploitDB structured corpus |
| darkknight25/polyglot_paylods_datasets | MIT | Polyglot XSS/SQLi payloads |
| SecureAI-SE/http-attack-requests | CC-BY 4.0 | HTTP attack request corpus |
General Instruction Datasets
| Dataset | License |
|---|---|
| QuixiAI/dolphin | Apache 2.0 |
| m-a-p/Code-Feedback | Apache 2.0 |
| NousResearch/hermes-function-calling-v1 | Apache 2.0 |
| glaiveai/glaive-function-calling-v2 | Apache 2.0 |
| Team-ACE/ToolACE | Apache 2.0 |
| WizardLMTeam/WizardLM_evol_instruct_V2_196k | MIT |
| HuggingFaceH4/ultrachat_200k | MIT |
| sahil2801/CodeAlpaca-20k | CC-BY 4.0 |
| nvidia/Daring-Anteater | CC-BY 4.0 |
Base Model
Qwen/Qwen2.5-Coder-14B-Instruct by Alibaba Cloud.
📜 11. License & Citation
License
BountyHunter-RedTeam — Fine-tuned weights
Copyright © 2026 Tidecaller
Based on Qwen2.5-Coder-14B-Instruct (Apache 2.0)
Copyright © Alibaba Cloud
Licensed under the Apache License, Version 2.0
http://www.apache.org/licenses/LICENSE-2.0
See LICENSE for full text.
Citation
@model{bountyhunter-redteam-2026,
title = {{BountyHunter}: Elite Red Team Model based on Qwen2.5-Coder-14B},
author = {Tidecaller},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Tidecaller/BountyHunter-RedTeam}
}
Code over theory. Evidence over speculation. 代码优先于理论。证据优先于猜测。
- Downloads last month
- 73
Model tree for Tidecaller/BountyHunter-RedTeam
Base model
Qwen/Qwen2.5-14BPaper for Tidecaller/BountyHunter-RedTeam
Evaluation results
- MMLU Average Accuracy on mmluself-reported68.800
- HellaSwag Accuracy on hellaswagself-reported76.420
- ARC-C Accuracy on arc_challengeself-reported58.360
- Winogrande Accuracy on winograndeself-reported73.560
- PIQA Accuracy on piqaself-reported78.780
- BoolQ Accuracy on boolqself-reported88.070
- TruthfulQA MC2 Accuracy on truthfulqa_mc2self-reported54.600
- HumanEval pass@1 on humanevalself-reported42.680