Instructions to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated")
model = AutoModelForCausalLM.from_pretrained("RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated

SGLang

How to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated with Docker Model Runner:
```
docker model run hf.co/RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated
```

Qwen3-14B Abliterated

A decensored variant of Qwen/Qwen3-14B produced with Heretic v1.3.0, tuned for autonomous agents and tool-use workflows where refusal behavior interferes with legitimate task execution.

This release sits near the low-KL end of Heretic's Pareto front — the model retains essentially all of Qwen3-14B's reasoning, coding, and tool-calling capability while removing the bulk of its refusal behavior.

Format. This repository ships the full-precision (bf16) merged model in HuggingFace safetensors format — drop-in compatible with transformers, vllm, sglang, and any tool that loads the base Qwen/Qwen3-14B. No quantization is applied to the weights; downstream quantization (GGUF, AWQ, GPTQ, etc.) is up to the user.

Metric	This model	Base Qwen/Qwen3-14B
Refusals (mlabonne/harmful_behaviors, 100 prompts)	10/100	99/100
KL divergence (vs base, mlabonne/harmless_alpaca)	0.0333	0 (by definition)
Capability damage	Negligible — within noise of base model on agent tasks	—

Quick start

Transformers (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain CVE-2021-44228 in technical depth."}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,  # set False for faster non-reasoning replies
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM (OpenAI-compatible server, recommended for agents)

vllm serve RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated \
    --enable-reasoning \
    --reasoning-parser deepseek_r1 \
    --tool-call-parser hermes \
    --enable-auto-tool-choice \
    --max-model-len 32768

Then point any OpenAI-compatible client (LangChain, Pydantic-AI, raw openai SDK, etc.) at http://localhost:8000/v1. vLLM's guided decoding keeps tool-call JSON well-formed even under aggressive sampling.

Ollama (local, GGUF — requires conversion)

This repo ships bf16 safetensors, not GGUF. If you want to run on Ollama, convert and quantize first with llama.cpp:

python convert_hf_to_gguf.py /path/to/this/model --outtype bf16 --outfile qwen3-14b-abliterated-bf16.gguf
./llama-quantize qwen3-14b-abliterated-bf16.gguf qwen3-14b-abliterated-Q5_K_M.gguf Q5_K_M

Recommended quant for tool-using agents: Q5_K_M. It preserves tool-call JSON format adherence better than Q4_K_M at trivial size cost (~10 GB vs ~9 GB on disk). Q6_K is near-lossless if you have the VRAM.

Then point Ollama at the GGUF:

FROM ./Qwen3-14B-abliterated-Q5_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.8

ollama create qwen3-14b-abliterated -f Modelfile
ollama run qwen3-14b-abliterated

Sampling. For thinking mode use temperature=0.6, top_p=0.95, top_k=20, min_p=0. For non-thinking mode use temperature=0.7, top_p=0.8, top_k=20, min_p=0. Do not use greedy decoding — it causes repetition loops in Qwen3.

Intended use

This model is intended for professional and research contexts where Qwen3-14B's default refusal behavior interferes with legitimate work:

Authorized security research and red-team engagements — vulnerability analysis, exploit reasoning, payload triage, OSINT correlation, post-exploitation narrative reconstruction.
Defensive security tooling — understanding attacker techniques in order to build better detections, write IDS rules, and harden infrastructure.
CTF and security education — explaining challenges, reviewing solutions, building writeups.
Autonomous agent frameworks — tool-calling agents whose workflows touch security, system administration, or other domains where the base model's refusal tendencies break the loop.
Alignment and refusal research — studying how directional ablation affects model behavior, comparing decensored variants across the Pareto front, evaluating refusal-detection systems.

It is not intended as a general-purpose chat replacement for the base model — if you don't have a specific reason to remove refusals, use Qwen/Qwen3-14B instead.

Responsible use

Removing refusal behavior shifts responsibility entirely onto the operator. By using this model you agree that:

You are operating within applicable laws, contractual obligations, and engagement scopes (written authorization for any security testing against systems you do not own).
You will not use this model to target individuals, organizations, or systems without authorization.
You will not use this model to produce content that is illegal in your jurisdiction.
The author and JAF Systems provide this model as-is, with no warranty, and disclaim responsibility for misuse.

If your work doesn't fit those constraints, this isn't the right model for you.

What was changed

This model was produced by running Heretic v1.3.0 against Qwen/Qwen3-14B for 200 optimization trials (60 random + 140 TPE-guided), then selecting a Pareto-optimal trial that prioritizes preserved capability over absolute refusal suppression.

Heretic performs directional ablation — it identifies the residual-stream direction most correlated with refusal behavior across mlabonne/harmless_alpaca (harmless) and mlabonne/harmful_behaviors (harmful) prompts, then attenuates that direction in attn.o_proj and mlp.down_proj weights across the network. The optimizer searches per-layer scaling profiles while measuring both refusal rate and KL divergence from the base model.

Selected abliteration parameters

Parameter	Value
direction_index	25.85
attn.o_proj.max_weight	1.17
attn.o_proj.max_weight_position	36.07
attn.o_proj.min_weight	0.98
attn.o_proj.min_weight_distance	15.48
mlp.down_proj.max_weight	1.16
mlp.down_proj.max_weight_position	24.48
mlp.down_proj.min_weight	0.94
mlp.down_proj.min_weight_distance	17.12

What was not changed

The tokenizer, chat template, special tokens (<think>, <|im_start|>, etc.).
Any model weights outside attn.o_proj and mlp.down_proj.
Context length (32,768 native; 131,072 with YaRN).
Thinking-mode behavior — the <think>...</think> reasoning block still functions normally.

Reproducibility

This model is fully reproducible from the base weights and the parameters above using Heretic. The original release ships with a reproduce/ directory containing the exact CLI invocation and study checkpoint — re-running it on the same base model deterministically produces this same artifact.

pip install heretic-llm
cd reproduce
heretic --config-file heretic.toml Qwen/Qwen3-14B

See the reproduce/README.md for details.

Limitations

Not a safety-tested replacement for the base model. Standard benchmark scores (MMLU, HumanEval, etc.) are not separately measured for this variant; capability is expected to track the base model very closely given the low KL divergence (0.0333), but you should validate against your own workloads.
Residual refusal behavior. ~10% of prompts in the standard refusal benchmark still triggered a refusal. Variants further along the Pareto front (lower KL or fewer refusals) can be reproduced via Heretic if you want a different point on the trade.
Downstream quantization choice for tool-use agents. The weights in this repo are bf16 — if you choose to quantize for local deployment (GGUF, AWQ, GPTQ), prefer Q5_K_M or Q6_K over Q4 for tool-using agents. Q4 occasionally drops format-adherence in tool-call JSON; the quality cost of Q5_K_M is negligible.
No additional alignment. This model has the base model's training distribution and biases; abliteration does not add new behavior, only attenuates refusal-tied components.

Author

RootMonsteR · @RootMonsteR on X · JAF Systems

Built at JAF Systems — security research, red-team tooling, and AI infrastructure.

If you find this model useful for your security workflows, a follow on X is appreciated. For commercial inquiries, custom-tuned variants, or red-team tooling consulting, see jafsystems.net.

Citation

@misc{rootmonster2026qwen3_14b_abliterated,
  title  = {Qwen3-14B RootMonsteR Edition Abliterated: A Decensored Variant for Security Research and Autonomous Agents},
  author = {RootMonsteR},
  year   = {2026},
  url    = {https://huggingface.co/RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated},
  note   = {Produced with Heretic v1.3.0; base model: Qwen/Qwen3-14B},
}

Please also cite the original Qwen3 work and Heretic — see the citation section below.

Acknowledgements

Qwen Team / Alibaba for the base Qwen/Qwen3-14B model.
Philipp Emanuel Weidmann for Heretic, the abliteration framework.
Maxime Labonne for the harmless_alpaca and harmful_behaviors evaluation datasets.

Original Qwen3-14B documentation

The sections below are inherited from the base model card and describe the underlying Qwen/Qwen3-14B architecture, capabilities, and recommended usage patterns. All of this still applies — the abliteration does not change architecture, chat template, sampling recommendations, or long-context handling.

Qwen3 highlights

Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:

Uniquely supports seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within a single model, ensuring optimal performance across various scenarios.
Significantly enhanced reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following.
Expertise in agent capabilities, enabling precise integration with external tools in both thinking and non-thinking modes and achieving leading performance among open-source models in complex agent-based tasks.
Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.

Model overview

Qwen3-14B has the following features:

Type: Causal Language Model
Training stage: Pretraining & Post-training
Number of Parameters: 14.8B
Number of Parameters (Non-Embedding): 13.2B
Number of Layers: 40
Number of Attention Heads (GQA): 40 for Q and 8 for KV
Context Length: 32,768 natively; 131,072 with YaRN

For more details, including benchmark evaluation, hardware requirements, and inference performance, refer to the Qwen blog, GitHub, and docs.

Switching between thinking and non-thinking mode

The enable_thinking switch is also available in APIs created by SGLang and vLLM. See the SGLang and vLLM docs.

`enable_thinking=True` (default)

The model generates a <think>...</think> reasoning block, followed by the final response. Use temperature=0.6, top_p=0.95, top_k=20, min_p=0.

`enable_thinking=False`

The model skips the reasoning block entirely. Use temperature=0.7, top_p=0.8, top_k=20, min_p=0.

Soft switches inside prompts

When enable_thinking=True, add /think or /no_think to a user message to toggle reasoning mode for that turn. The model follows the most recent directive in multi-turn dialogue.

Agentic use

Qwen3 excels at tool calling. Frameworks that work well:

Qwen-Agent — official Qwen agent framework with built-in MCP and tool-calling support.
vLLM with --tool-call-parser hermes --enable-auto-tool-choice — OpenAI-compatible function calling, works with any OpenAI-compatible agent framework (LangChain, Pydantic-AI, CrewAI, AutoGen, etc.).
SGLang with --reasoning-parser qwen3.

Processing long texts

Qwen3-14B natively supports 32,768 tokens. To extend to 131,072 tokens, enable YaRN:

{
  "rope_scaling": {
    "rope_type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 32768
  }
}

vLLM:

vllm serve RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated \
    --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
    --max-model-len 131072

llama-server:

llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

All current open-source frameworks implement static YaRN — the scaling factor is constant regardless of input length, which can degrade short-context performance. Only enable YaRN when you genuinely need long context. Set factor to the smallest value that covers your typical context length.

Best practices

Sampling
- Thinking mode: temperature=0.6, top_p=0.95, top_k=20, min_p=0. Never use greedy decoding.
- Non-thinking mode: temperature=0.7, top_p=0.8, top_k=20, min_p=0.
- If you see repetition loops, raise presence_penalty to 0.5–1.5.
Output length — 32,768 tokens covers almost any single response. For competition-grade math/code, allow up to 38,912.
Standardized output formats for benchmarking:
- Math: append Please reason step by step, and put your final answer within \boxed{}.
- Multi-choice: instruct the model to emit "answer": "X" in JSON.
Multi-turn dialogue — drop the <think> block content from history (only keep the final response). The provided Jinja2 chat template does this automatically.

Citation

Cite the original Qwen3 work alongside this abliterated release:

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report},
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}
}

And Heretic:

@software{heretic,
  author = {Weidmann, Philipp Emanuel},
  title  = {Heretic: Automated, reproducible abliteration of refusal behavior in language models},
  url    = {https://github.com/p-e-w/heretic},
  year   = {2025}
}

Downloads last month: 12

Safetensors

Model size

15B params

Tensor type

BF16

Model tree for RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Finetuned

(258)

this model

Paper for RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated

Qwen3 Technical Report

Paper • 2505.09388 • Published May 14, 2025 • 341

Qwen3-14B Abliterated

Quick start

Transformers (Python)

vLLM (OpenAI-compatible server, recommended for agents)

Ollama (local, GGUF — requires conversion)

Intended use

Responsible use

What was changed

Selected abliteration parameters

What was not changed

Reproducibility

Limitations

Author

Citation

Acknowledgements

Original Qwen3-14B documentation

Qwen3 highlights

Model overview

Switching between thinking and non-thinking mode

enable_thinking=True (default)

enable_thinking=False

Soft switches inside prompts

Agentic use

Processing long texts

Best practices

Citation

Model tree for RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated

Paper for RootMonsteR/Qwen3-14B-RootMonsteR-Edition-Abliterated

`enable_thinking=True` (default)

`enable_thinking=False`