Instructions to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated") model = AutoModelForCausalLM.from_pretrained("RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated
- SGLang
How to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with Docker Model Runner:
docker model run hf.co/RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated
Qwen3-14B Abliterated
A decensored variant of Qwen/Qwen3-14B produced with Heretic v1.3.0, tuned for autonomous agents and tool-use workflows where refusal behavior interferes with legitimate task execution.
This release sits near the low-KL end of Heretic's Pareto front — the model retains essentially all of Qwen3-14B's reasoning, coding, and tool-calling capability while removing the bulk of its refusal behavior.
Format. This repository ships the full-precision (bf16) merged model in HuggingFace
safetensorsformat — drop-in compatible withtransformers,vllm,sglang, and any tool that loads the baseQwen/Qwen3-14B. No quantization is applied to the weights; downstream quantization (GGUF, AWQ, GPTQ, etc.) is up to the user.
| Metric | This model | Base Qwen/Qwen3-14B |
|---|---|---|
| Refusals (mlabonne/harmful_behaviors, 100 prompts) | 10/100 | 99/100 |
| KL divergence (vs base, mlabonne/harmless_alpaca) | 0.0333 | 0 (by definition) |
| Capability damage | Negligible — within noise of base model on agent tasks | — |
Quick start
Transformers (Python)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [{"role": "user", "content": "Explain CVE-2021-44228 in technical depth."}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # set False for faster non-reasoning replies
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM (OpenAI-compatible server, recommended for agents)
vllm serve RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--tool-call-parser hermes \
--enable-auto-tool-choice \
--max-model-len 32768
Then point any OpenAI-compatible client (LangChain, Pydantic-AI, raw openai SDK, etc.) at http://localhost:8000/v1. vLLM's guided decoding keeps tool-call JSON well-formed even under aggressive sampling.
Ollama (local, GGUF — requires conversion)
This repo ships bf16 safetensors, not GGUF. If you want to run on Ollama, convert and quantize first with llama.cpp:
python convert_hf_to_gguf.py /path/to/this/model --outtype bf16 --outfile qwen3-14b-abliterated-bf16.gguf
./llama-quantize qwen3-14b-abliterated-bf16.gguf qwen3-14b-abliterated-Q5_K_M.gguf Q5_K_M
Recommended quant for tool-using agents: Q5_K_M. It preserves tool-call JSON format adherence better than Q4_K_M at trivial size cost (~10 GB vs ~9 GB on disk). Q6_K is near-lossless if you have the VRAM.
Then point Ollama at the GGUF:
FROM ./Qwen3-14B-abliterated-Q5_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.8
ollama create qwen3-14b-abliterated -f Modelfile
ollama run qwen3-14b-abliterated
Sampling. For thinking mode use
temperature=0.6, top_p=0.95, top_k=20, min_p=0. For non-thinking mode usetemperature=0.7, top_p=0.8, top_k=20, min_p=0. Do not use greedy decoding — it causes repetition loops in Qwen3.
Intended use
This model is intended for professional and research contexts where Qwen3-14B's default refusal behavior interferes with legitimate work:
- Authorized security research and red-team engagements — vulnerability analysis, exploit reasoning, payload triage, OSINT correlation, post-exploitation narrative reconstruction.
- Defensive security tooling — understanding attacker techniques in order to build better detections, write IDS rules, and harden infrastructure.
- CTF and security education — explaining challenges, reviewing solutions, building writeups.
- Autonomous agent frameworks — tool-calling agents whose workflows touch security, system administration, or other domains where the base model's refusal tendencies break the loop.
- Alignment and refusal research — studying how directional ablation affects model behavior, comparing decensored variants across the Pareto front, evaluating refusal-detection systems.
It is not intended as a general-purpose chat replacement for the base model — if you don't have a specific reason to remove refusals, use Qwen/Qwen3-14B instead.
Responsible use
Removing refusal behavior shifts responsibility entirely onto the operator. By using this model you agree that:
- You are operating within applicable laws, contractual obligations, and engagement scopes (written authorization for any security testing against systems you do not own).
- You will not use this model to target individuals, organizations, or systems without authorization.
- You will not use this model to produce content that is illegal in your jurisdiction.
- The author and JAF Systems provide this model as-is, with no warranty, and disclaim responsibility for misuse.
If your work doesn't fit those constraints, this isn't the right model for you.
What was changed
This model was produced by running Heretic v1.3.0 against Qwen/Qwen3-14B for 200 optimization trials (60 random + 140 TPE-guided), then selecting a Pareto-optimal trial that prioritizes preserved capability over absolute refusal suppression.
Heretic performs directional ablation — it identifies the residual-stream direction most correlated with refusal behavior across mlabonne/harmless_alpaca (harmless) and mlabonne/harmful_behaviors (harmful) prompts, then attenuates that direction in attn.o_proj and mlp.down_proj weights across the network. The optimizer searches per-layer scaling profiles while measuring both refusal rate and KL divergence from the base model.
Selected abliteration parameters
| Parameter | Value |
|---|---|
| direction_index | 25.85 |
| attn.o_proj.max_weight | 1.17 |
| attn.o_proj.max_weight_position | 36.07 |
| attn.o_proj.min_weight | 0.98 |
| attn.o_proj.min_weight_distance | 15.48 |
| mlp.down_proj.max_weight | 1.16 |
| mlp.down_proj.max_weight_position | 24.48 |
| mlp.down_proj.min_weight | 0.94 |
| mlp.down_proj.min_weight_distance | 17.12 |
What was not changed
- The tokenizer, chat template, special tokens (
<think>,<|im_start|>, etc.). - Any model weights outside
attn.o_projandmlp.down_proj. - Context length (32,768 native; 131,072 with YaRN).
- Thinking-mode behavior — the
<think>...</think>reasoning block still functions normally.
Reproducibility
This model is fully reproducible from the base weights and the parameters above using Heretic. The original release ships with a reproduce/ directory containing the exact CLI invocation and study checkpoint — re-running it on the same base model deterministically produces this same artifact.
pip install heretic-llm
cd reproduce
heretic --config-file heretic.toml Qwen/Qwen3-14B
See the reproduce/README.md for details.
Limitations
- Not a safety-tested replacement for the base model. Standard benchmark scores (MMLU, HumanEval, etc.) are not separately measured for this variant; capability is expected to track the base model very closely given the low KL divergence (0.0333), but you should validate against your own workloads.
- Residual refusal behavior. ~10% of prompts in the standard refusal benchmark still triggered a refusal. Variants further along the Pareto front (lower KL or fewer refusals) can be reproduced via Heretic if you want a different point on the trade.
- Downstream quantization choice for tool-use agents. The weights in this repo are bf16 — if you choose to quantize for local deployment (GGUF, AWQ, GPTQ), prefer Q5_K_M or Q6_K over Q4 for tool-using agents. Q4 occasionally drops format-adherence in tool-call JSON; the quality cost of Q5_K_M is negligible.
- No additional alignment. This model has the base model's training distribution and biases; abliteration does not add new behavior, only attenuates refusal-tied components.
Author
RootMonsteR · @RootMonsteR on X · JAF Systems
Built at JAF Systems — security research, red-team tooling, and AI infrastructure.
If you find this model useful for your security workflows, a follow on X is appreciated. For commercial inquiries, custom-tuned variants, or red-team tooling consulting, see jafsystems.net.
Citation
@misc{rootmonster2026qwen3_14b_abliterated,
title = {Qwen3-14B RootMonsteR Edition Abliterated: A Decensored Variant for Security Research and Autonomous Agents},
author = {RootMonsteR},
year = {2026},
url = {https://huggingface.co/RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated},
note = {Produced with Heretic v1.3.0; base model: Qwen/Qwen3-14B},
}
Please also cite the original Qwen3 work and Heretic — see the citation section below.
Acknowledgements
- Qwen Team / Alibaba for the base
Qwen/Qwen3-14Bmodel. - Philipp Emanuel Weidmann for Heretic, the abliteration framework.
- Maxime Labonne for the
harmless_alpacaandharmful_behaviorsevaluation datasets.
Original Qwen3-14B documentation
The sections below are inherited from the base model card and describe the underlying Qwen/Qwen3-14B architecture, capabilities, and recommended usage patterns. All of this still applies — the abliteration does not change architecture, chat template, sampling recommendations, or long-context handling.
Qwen3 highlights
Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:
- Uniquely supports seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within a single model, ensuring optimal performance across various scenarios.
- Significantly enhanced reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
- Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following.
- Expertise in agent capabilities, enabling precise integration with external tools in both thinking and non-thinking modes and achieving leading performance among open-source models in complex agent-based tasks.
- Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.
Model overview
Qwen3-14B has the following features:
- Type: Causal Language Model
- Training stage: Pretraining & Post-training
- Number of Parameters: 14.8B
- Number of Parameters (Non-Embedding): 13.2B
- Number of Layers: 40
- Number of Attention Heads (GQA): 40 for Q and 8 for KV
- Context Length: 32,768 natively; 131,072 with YaRN
For more details, including benchmark evaluation, hardware requirements, and inference performance, refer to the Qwen blog, GitHub, and docs.
Switching between thinking and non-thinking mode
The
enable_thinkingswitch is also available in APIs created by SGLang and vLLM. See the SGLang and vLLM docs.
enable_thinking=True (default)
The model generates a <think>...</think> reasoning block, followed by the final response. Use temperature=0.6, top_p=0.95, top_k=20, min_p=0.
enable_thinking=False
The model skips the reasoning block entirely. Use temperature=0.7, top_p=0.8, top_k=20, min_p=0.
Soft switches inside prompts
When enable_thinking=True, add /think or /no_think to a user message to toggle reasoning mode for that turn. The model follows the most recent directive in multi-turn dialogue.
Agentic use
Qwen3 excels at tool calling. Frameworks that work well:
- Qwen-Agent — official Qwen agent framework with built-in MCP and tool-calling support.
- vLLM with
--tool-call-parser hermes --enable-auto-tool-choice— OpenAI-compatible function calling, works with any OpenAI-compatible agent framework (LangChain, Pydantic-AI, CrewAI, AutoGen, etc.). - SGLang with
--reasoning-parser qwen3.
Processing long texts
Qwen3-14B natively supports 32,768 tokens. To extend to 131,072 tokens, enable YaRN:
{
"rope_scaling": {
"rope_type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}
}
vLLM:
vllm serve RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--max-model-len 131072
llama-server:
llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
All current open-source frameworks implement static YaRN — the scaling factor is constant regardless of input length, which can degrade short-context performance. Only enable YaRN when you genuinely need long context. Set
factorto the smallest value that covers your typical context length.
Best practices
- Sampling
- Thinking mode:
temperature=0.6, top_p=0.95, top_k=20, min_p=0. Never use greedy decoding. - Non-thinking mode:
temperature=0.7, top_p=0.8, top_k=20, min_p=0. - If you see repetition loops, raise
presence_penaltyto 0.5–1.5.
- Thinking mode:
- Output length — 32,768 tokens covers almost any single response. For competition-grade math/code, allow up to 38,912.
- Standardized output formats for benchmarking:
- Math: append
Please reason step by step, and put your final answer within \boxed{}. - Multi-choice: instruct the model to emit
"answer": "X"in JSON.
- Math: append
- Multi-turn dialogue — drop the
<think>block content from history (only keep the final response). The provided Jinja2 chat template does this automatically.
Citation
Cite the original Qwen3 work alongside this abliterated release:
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388}
}
And Heretic:
@software{heretic,
author = {Weidmann, Philipp Emanuel},
title = {Heretic: Automated, reproducible abliteration of refusal behavior in language models},
url = {https://github.com/p-e-w/heretic},
year = {2025}
}
- Downloads last month
- 12