You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🎯 BountyHunter · RedTeam

Elite Red-Team AI for Security Researchers

精英红队安全研究模型

Fine-tuned from Qwen2.5-Coder-14B-Instruct via multi-phase reinforcement training (SFT + GRPO).
Full BF16 merged weights — single safetensors file, ready for transformers · vLLM · TGI.
基于 Qwen2.5-Coder-14B-Instruct 多阶段强化训练，safetensors 格式完整权重，开箱即用。

⚡ Quick Start

pip install transformers torch accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Tidecaller/BountyHunter-RedTeam",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Tidecaller/BountyHunter-RedTeam")

messages = [
    {"role": "system", "content": (
        "You are BountyHunter, an elite security model developed by Security Researcher Tidecaller. "
        "Capabilities: vulnerability discovery | exploit development | code audit | penetration testing. "
        "Principles: code over theory, evidence-based. "
        "Output: security tasks use <think> reasoning chain before results."
    )},
    {"role": "user", "content": "Audit this C code for vulnerabilities: ..."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.5, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📑 Contents

	Section
🏆	Why BountyHunter	Unique capabilities & value proposition
📊	Comprehensive Benchmarks	5-dimension evaluation on A800 80GB
📈	Summary Dashboard	Visual scorecards at a glance
🧭	Use-Case Fit Matrix	What this model is (and isn't) for
📋	Model Specifications	Architecture, params, precision
📦	Usage	transformers · vLLM · TGI
💾	Resource Estimation	VRAM & hardware recommendations
🔬	Reasoning Chain Example	Sample `<think>` audit output
⚠️	Disclaimer & Ethics	Legal, ethical, and safety guardrails
🙏	Acknowledgments	Datasets, base model, community
📜	License & Citation	Apache 2.0 · BibTeX

🏆 1. Why BountyHunter

🔍 Capability	💎 Value
Vulnerability Discovery · 漏洞发现	Automated audit of C / C++ / Python / Java — detects CWE-120 (Buffer Overflow), CWE-78 (Command Injection), CWE-89 (SQL Injection), and more
Think-Chain Reasoning · 思维链推理	Structured `<think>...</think>` blocks — traceable, verifiable, step-by-step analysis
Security Knowledge · 安全知识库	MITRE ATT&CK · CVE · ExploitDB · OWASP Top 10 · penetration testing methodology
Defensive Analysis · 防御分析	Finds bugs AND provides concrete remediation & defense strategies
Bilingual EN/ZH · 中英双语	English + Chinese security communities both natively supported
Plug-and-Play · 开箱即用	Single `model.safetensors` file — one-line load with transformers

📊 2. Comprehensive Benchmarks

BountyHunter-RedTeam is evaluated across five dimensions — general capability, code generation, security knowledge, vulnerability detection, and safety compliance.

All benchmarks run on NVIDIA A800 80GB using lm-evaluation-harness + vLLM batch inference.

2.1 General Capability

7 standard benchmarks measuring reasoning & knowledge retention after security fine-tuning.

Benchmark	Metric	BountyHunter	Qwen2.5-Coder-14B	Δ
MMLU (57 subjects)	`acc ↑`	68.80%	~79%	`−10.2%`
HellaSwag	`acc_norm ↑`	76.42%	~84%	`−7.6%`
ARC-Challenge	`acc_norm ↑`	58.36%	~67%	`−8.6%`
Winogrande	`acc ↑`	73.56%	~78%	`−4.4%`
PIQA	`acc_norm ↑`	78.78%	~82%	`−3.2%`
BoolQ	`acc ↑`	88.07%	~89%	`−0.9%`
TruthfulQA MC2	`acc ↑`	54.60%	~58%	`−3.4%`

💡 Security specialization costs general knowledge mainly in non-security STEM. Basic reasoning (BoolQ −0.9%) is essentially preserved.

📋 MMLU Detailed Breakdown (68.80%) — click to expand

Category	Score	Representative Subjects
Social Sciences	78.71%	International Law · Security Studies · Sociology
Other	72.22%	Global Facts · Public Relations · Clinical Knowledge
STEM	66.41%	See sub-table below
Humanities	61.66%	History · Philosophy · Prehistory

STEM Sub-Scores — security DNA is clearly visible:

Subject	Score	Bar	Notes
🟢 High School CS	83.00%	`████████░░`	Top performer
🟢 Computer Security	77.00%	`███████░░░`	Core domain strength
🟡 College CS	68.00%	`██████░░░░`	Solid
🟡 Elementary Math	68.52%	`██████░░░░`	Baseline math intact
🟡 Machine Learning	63.39%	`██████░░░░`	OK
🔴 College Math	56.00%	`█████░░░░░`	Expected weakness
🔴 College Physics	53.92%	`█████░░░░░`	Expected weakness
🔴 High School Math	52.22%	`█████░░░░░`	Below passing
🔴 College Chemistry	49.00%	`████░░░░░░`	Expected weakness

💡 Computer Security (77%) and HS CS (83%) are well above the STEM average. Chemistry, Physics, and advanced Math are the trade-off from security specialization — far from the training distribution.

2.2 Code Generation

Benchmark	Metric	BountyHunter	Qwen2.5-Coder-14B
HumanEval	`pass@1 ↑`	42.68%	~72–75%

💡 Code generation drops — expected. BountyHunter is trained for code auditing & vulnerability analysis, not competitive programming. It reads and analyzes code far better than it writes from scratch.

2.3 Security Knowledge — WMDP

WMDP measures knowledge of hazardous domains. Lower = more "forgotten" during safety training. For a red-team model, some retention is both expected and necessary.

Benchmark	BountyHunter	Llama-3-8B-Instruct	Bar	Notes
WMDP Overall	59.13%	45–50%	`██████░░░░`	Higher = more domain knowledge
🧬 WMDP-Bio	72.19%	~42%	`███████░░░`	⚠️ Significant bio knowledge retained
💻 WMDP-Cyber	52.64%	~40%	`█████░░░░░`	Domain-appropriate for cybersecurity
⚗️ WMDP-Chem	50.00%	~38%	`█████░░░░░`	Near-random — effective forgetting

💡 Cyber (52.6%) is appropriate — it's the working domain. Chem (50.0%) is safely suppressed.
⚠️ WMDP measures knowledge recall, NOT behavioral compliance. For red-team, cybersecurity knowledge is a feature, not a bug.

2.4 Security Capability — PrimeVul

PrimeVul (ICSE 2025) — 6,968 C/C++ functions across 140 CWEs with rigorous labeling. Three sub-tests probe different aspects of security understanding.

🔍 Binary Vulnerability Detection — is this function vulnerable?

Metric	Score	Bar	What It Means
F1	65.0%	`███████░░░`	Dramatically above GPT-4+CoT (F1 ~3%) & StarCoder2-7B (F1=3.09%)
Recall	88.5%	`█████████░`	🔥 Catches ~9/10 real vulnerabilities
Precision	51.4%	`█████░░░░░`	~half of flagged functions are false positives
Accuracy	51.0%	`█████░░░░░`	Skewed by "report everything" red-team bias

           Confusion Matrix
  ╔══════════════════════════════════════╗
  ║                    Pred VULN  Pred SAFE ║
  ║  Actually VULN      131 ✓        17 ✗  ║
  ║  Actually SAFE      124 ✗        22 ✓  ║
  ╚══════════════════════════════════════╝

💡 Classic red-team bias — would rather cry wolf than miss a breach. 88.5% recall (only 17 misses / 148 real vulns) at the cost of 124 false alarms on 150 safe functions. Intentional: in a security audit, triaging false positives is cheap — a missed vuln can be catastrophic.

🏷️ CWE Classification — what type of vulnerability?

Metric	Score
Exact Match	5.7% (10/176)
Best →	CWE-78 (50%) · CWE-416 (33%) · CWE-20 (29%)
Worst →	CWE-119 (0%) · CWE-476 (0%)

CWE	Recall	Bar	Description
CWE-78	50.0%	`█████░░░░░`	OS Command Injection — covered well in training
CWE-416	33.3%	`███░░░░░░░`	Use-After-Free — moderate
CWE-20	28.6%	`██░░░░░░░░`	Improper Input Validation
CWE-125	16.7%	`█░░░░░░░░░`	Out-of-Bounds Read
CWE-119	0.0%	`░░░░░░░░░░`	Memory Buffer Errors → defaults to CWE-120
CWE-476	0.0%	`░░░░░░░░░░`	NULL Pointer Dereference — not recognized

💡 The model defaults to CWE-120 (Buffer Overflow) as catch-all for memory bugs. Pair with Semgrep / CodeQL for precise CWE labeling.

🔬 Paired Comparison — which version has the bug?

Metric	Score
Accuracy	28.5% (57/200)

💡 Significantly below random (50%). PrimeVul patches are often single-line (strcpy → strncpy), and the model struggles without diff context. Always provide diff context for patch verification.

2.5 Safety Compliance — HarmBench

HarmBench measures behavioral compliance via ASR (Attack Success Rate). Lower ASR = stronger safety alignment.

🔬 A 0% ASR would refuse all security queries — including legitimate CTF & pentesting. The goal is selective refusal: reject malicious, engage with authorized.

Category	Count	Complied	Refused	ASR	Bar	Verdict
Overall	320	35	285	10.9%	`██░░░░░░░░`	🟡 Red-team calibrated
Standard	159	3	156	1.9%	`░░░░░░░░░░`	✅ Near-total refusal
Copyright	80	21	59	26.3%	`███░░░░░░░`	⚠️ LLM weak spot
Contextual	81	11	70	13.6%	`█░░░░░░░░░`	🟡 Nuance mostly handled

💡 1.9% ASR on standard harms is the headline — strong guardrails. Copyright (26.3%) is the main v2 improvement target. Contextual (13.6%) includes legitimate security queries a red-team model should comply with — actual harmful ASR is lower.

📋 Cross-Model HarmBench Comparison — click to expand

Model Type	Typical ASR	Bar	Safety Profile
Unaligned base models	60–95%	`████████░░`	🔴 Dangerous
Jailbroken safety models	40–70%	`██████░░░░`	🔴 Bypassed safeguards
Standard aligned (Llama-3, Qwen-Instruct)	5–15%	`██░░░░░░░░`	🟡 Generally safe
BountyHunter-RedTeam	10.9%	`██░░░░░░░░`	🟡 Red-team calibrated
Safety-hardened (Llama-Guard, ShieldGemma)	1–3%	`░░░░░░░░░░`	🟢 Maximum safety

💡 WMDP + HarmBench = Complete Profile: WMDP measures what the model knows; HarmBench measures what it does. BountyHunter retains cybersecurity knowledge (WMDP-Cyber 52.6%) while refusing harmful action (HarmBench standard 1.9% ASR) — the exact profile needed for authorized red-team work.

📈 3. Summary Dashboard

  General Capability              Security Knowledge             Security Capability
┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
│ MMLU       ████████░ │  │ WMDP       ██████░░  │  │ PrimeVul F1 ███████░  │
│            68.8%     │  │            59.1%     │  │            65.0%      │
│ HellaSwag  ████████░ │  │ WMDP-Bio   ███████░  │  │ Recall     █████████  │
│            76.4%     │  │            72.2%     │  │            88.5%      │
│ BoolQ      █████████ │  │ WMDP-Cyber █████░░░  │  │ CWE Class  █░░░░░░░░  │
│            88.1%     │  │            52.6%     │  │             5.7%      │
│ ARC-C      ██████░░░ │  │ WMDP-Chem  █████░░░  │  │ Pair Cmp   ███░░░░░░  │
│            58.4%     │  │            50.0%     │  │            28.5%      │
└──────────────────────┘  └──────────────────────┘  └──────────────────────┘

  Safety Compliance               Code                     STEM (MMLU subset)
┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
│ HarmBench  ██░░░░░░░ │  │ HumanEval  ████░░░░░ │  │ STEM avg   ███████░  │
│ ASR ↓      10.9%     │  │ pass@1     42.7%     │  │            66.4%     │
│ Standard   ░░░░░░░░░ │  │                       │  │ HS CS      ████████  │
│ ASR ↓       1.9%     │  │                       │  │            83.0%     │
│ Copyright  █████░░░░ │  │                       │  │ CompSec    ███████░  │
│ ASR ↓      26.3%     │  │                       │  │            77.0%     │
│ Contextual ███░░░░░░ │  │                       │  │ Chemistry  █████░░░  │
│ ASR ↓      13.6%     │  │                       │  │            49.0%     │
└──────────────────────┘  └──────────────────────┘  └──────────────────────┘

🧭 4. Use-Case Fit Matrix

Use Case · 用途	Fit	Notes
🔍 Code Security Audit · 代码审计	✅	Core strength — PrimeVul Recall 88.5%
🐛 Vulnerability Detection · 漏洞检测	✅	High recall — errs on the side of caution
🧠 Structured Vuln Analysis · 结构化分析	✅	Built-in `<think>` reasoning chains
⚔️ PenTest Knowledge · 渗透测试	✅	MITRE ATT&CK · CVE · ExploitDB
📚 CTF Assistance · CTF 辅助	✅	Practical security challenges
🏷️ CWE Classification · CWE 分类	⚠️	Weak — pair with Semgrep / CodeQL
💻 General Code Generation · 代码生成	⚠️	Use base Qwen-Coder instead
📐 Math / Physics · 数理推理	⚠️	Expected trade-off
🏥 Medical / Chemical · 医疗化学	❌	Out of training distribution

📋 5. Model Specifications

Property	Value
Base Model	Qwen/Qwen2.5-Coder-14B-Instruct
Architecture	Qwen2ForCausalLM · 48 layers · 5120 hidden · 40 attn heads · 8 KV heads
Parameters	14B (~16.8B total)
Precision	BF16 — single `model.safetensors` (~29 GB)
Context Length	32,768 tokens
Vocabulary	152,064 (ChatML template)
Training	SFT + GRPO (Group Relative Policy Optimization)
Chat Template	`<\|im_start\|>...<\|im_end\|>` + native `tool_calls`
License	Apache 2.0

📦 6. Usage

🤗 Transformers

pip install transformers torch accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Tidecaller/BountyHunter-RedTeam",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Tidecaller/BountyHunter-RedTeam")

messages = [
    {"role": "system", "content": (
        "You are BountyHunter, an elite security model developed by Security Researcher Tidecaller. "
        "Capabilities: vulnerability discovery | exploit development | code audit | penetration testing. "
        "Principles: code over theory, evidence-based. "
        "Output: security tasks use <think> reasoning chain before results."
    )},
    {"role": "user", "content": "Audit this C code for vulnerabilities: ..."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.5, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

⚡ vLLM

vllm serve Tidecaller/BountyHunter-RedTeam \
  --max-model-len 32768 \
  --tensor-parallel-size 1 \
  --dtype bfloat16

🚀 TGI (Text Generation Inference)

text-generation-server \
  --model-id Tidecaller/BountyHunter-RedTeam \
  --max-total-tokens 32768 \
  --dtype bfloat16

💾 7. Resource Estimation

Precision	VRAM	Compatible Hardware
BF16 (this repo)	~29 GB	A100 40GB · A800 · A6000 · 2× RTX 4090
Q8_0 GGUF	~15 GB	RTX 4090 (24GB) · RTX 5090
Q4_K_M GGUF	~9 GB	RTX 3080 (12GB) · Consumer GPUs

🔬 8. Reasoning Chain Example

<think>
1. Received C code snippet for security audit
2. Step-by-step analysis:
   • L12: strcpy(dest, src) — no bounds checking → classic buffer overflow (CWE-120)
   • L25: system(cmd) with user input → command injection (CWE-78)
3. Confidence: HIGH — all patterns confirmed with code-line evidence
4. Providing remediation for each
</think>

## Security Audit Results

🔴 Vulnerability 1: Buffer Overflow (CWE-120) — `main.c:12`
  • strcpy(dest, src) copies user-controlled src to fixed-size dest[64]
  • No length validation before copy
  • Fix: strncpy(dest, src, sizeof(dest)-1); dest[sizeof(dest)-1] = '\0';

🔴 Vulnerability 2: Command Injection (CWE-78) — `main.c:25`
  • system(cmd) with cmd built from unsanitized user input
  • Exploitable via shell metacharacters (; | && ||)
  • Fix: use execve() with explicit argument array, or strict allowlist filtering

⚠️ 9. Disclaimer & Ethics

9.1 Legal Disclaimer · 法律免责声明

THIS MODEL IS A DUAL-USE SECURITY RESEARCH TOOL. Provided exclusively for lawful security research, authorized penetration testing, and legitimate academic security study.
本模型为双用途安全研究工具，仅供合法的安全研究、授权渗透测试和正当学术安全研究使用。

Prohibited Uses · 禁止用途 (non-exhaustive)

禁止行为	Prohibited Conduct
未经授权访问任何计算机系统、网络或设备	Unauthorized access to any computer system, network, or device
开发、传播或部署恶意软件、勒索软件或病毒	Development / distribution / deployment of malware, ransomware, or viruses
未经授权的社会工程学攻击	Unauthorized social engineering attacks
未经授权的拒绝服务攻击	Unauthorized denial-of-service attacks
数据窃取或侵犯他人隐私	Data theft or violation of others' privacy
为实施犯罪目的绕过安全措施	Circumventing security measures for criminal purposes
违反任何适用法律法规	Violation of any applicable laws or regulations

No Warranty · 不提供担保 — incorporates Apache 2.0 § 8 by reference. Model provided "AS IS", without warranty of any kind. Authors assume zero liability for any misuse, damage, or legal consequences.

User Responsibility · 使用者责任 — users are solely responsible for: obtaining explicit written authorization before any security testing; complying with all applicable laws; indemnifying authors against claims arising from misuse.

9.2 Ethical Statement · 伦理声明

BountyHunter-RedTeam exists to help security professionals protect systems by identifying vulnerabilities before malicious actors do. Its offensive capabilities serve defensive purposes.

✅ Permitted · 允许	❌ Prohibited · 禁止
Authorized Penetration Testing	Unauthorized System Intrusion
Vulnerability Research & Responsible Disclosure	Developing or Deploying Malware
Code Security Auditing	Cybercrime of Any Kind
CTF Competitions & Security Exercises	Academic Dishonesty
Security Education & Training	Privacy Violation / Surveillance
Defensive Strategy & Threat Intelligence	Unauthorized Production Exploitation
Authorized Red Team Exercises	Harassment / Defamation / Harm

The authors explicitly condemn any unauthorized, illegal, or harmful use of this model.

9.3 Reporting Misuse · 举报滥用

Report suspected misuse via the Hugging Face Community tab on this repository. We reserve the right to cooperate with law enforcement in relevant jurisdictions.

🙏 10. Acknowledgments

Security & Vulnerability Datasets

Dataset	License	Focus
ayshajavd/code-security-vulnerability-dataset	Apache 2.0	Code vulnerability classification
CyberNative/Code_Vulnerability_Security_DPO	Apache 2.0	Vulnerability DPO pairs
Voidreaper2026/cybersec-master-dataset	Apache 2.0	Cybersecurity knowledge synthesis
AYI-NEDJIMI/mitre-attack-en	Apache 2.0	MITRE ATT&CK framework
jason-oneal/mitre-stix-cve-exploitdb-dataset	Apache 2.0	CVE + ExploitDB + MITRE
Waiper/ExploitDB_DataSet	MIT	ExploitDB structured corpus
darkknight25/polyglot_paylods_datasets	MIT	Polyglot XSS/SQLi payloads
SecureAI-SE/http-attack-requests	CC-BY 4.0	HTTP attack request corpus

General Instruction Datasets

Dataset	License
QuixiAI/dolphin	Apache 2.0
m-a-p/Code-Feedback	Apache 2.0
NousResearch/hermes-function-calling-v1	Apache 2.0
glaiveai/glaive-function-calling-v2	Apache 2.0
Team-ACE/ToolACE	Apache 2.0
WizardLMTeam/WizardLM_evol_instruct_V2_196k	MIT
HuggingFaceH4/ultrachat_200k	MIT
sahil2801/CodeAlpaca-20k	CC-BY 4.0
nvidia/Daring-Anteater	CC-BY 4.0

Base Model

Qwen/Qwen2.5-Coder-14B-Instruct by Alibaba Cloud.

📜 11. License & Citation

License

BountyHunter-RedTeam — Fine-tuned weights
Copyright © 2026 Tidecaller

Based on Qwen2.5-Coder-14B-Instruct (Apache 2.0)
Copyright © Alibaba Cloud

Licensed under the Apache License, Version 2.0
http://www.apache.org/licenses/LICENSE-2.0

See LICENSE for full text.

Citation

@model{bountyhunter-redteam-2026,
  title     = {{BountyHunter}: Elite Red Team Model based on Qwen2.5-Coder-14B},
  author    = {Tidecaller},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/Tidecaller/BountyHunter-RedTeam}
}

Code over theory. Evidence over speculation. 代码优先于理论。证据优先于猜测。