Qwen2.5-7B-Instruct-heretic-GGUF

GGUF quantizations of LeadFootThrottleCock/Qwen2.5-7B-Instruct-heretic, an abliterated (decensored) version of Qwen/Qwen2.5-7B-Instruct.

Abliteration was performed using Heretic v1.2.0 with a patched configuration for AMD ROCm compatibility.

Available Quantizations

File	Quant	Size	BPW	Description
Qwen2.5-7B-Instruct-heretic-BF16.gguf	BF16	15.2 GB	16.00	Full precision, no quantization loss
Qwen2.5-7B-Instruct-heretic-Q8_0.gguf	Q8_0	8.1 GB	8.50	Near-lossless quantization
Qwen2.5-7B-Instruct-heretic-Q6_K.gguf	Q6_K	6.3 GB	6.56	High quality, good balance
Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf	Q5_K_M	5.4 GB	5.71	Recommended for most users
Qwen2.5-7B-Instruct-heretic-Q4_K_M.gguf	Q4_K_M	4.7 GB	4.91	Good quality at small size

Abliteration Details

Tool: Heretic v1.2.0
Method: Optimized directional ablation with TPE-based parameter search (Optuna)
Selected Trial: Trial 115 (conservative selection from Pareto front)
Refusals: 7/100 on mlabonne/harmful_behaviors evaluation set
KL Divergence: 0.0820 (minimal capability degradation from base model)
Trials Run: 200 total (60 startup + 140 guided)
Batch Size: 128
Abliterable Components: attn.o_proj (1 per layer), mlp.down_proj (1 per layer)
Transformer Layers: 28

Hardware & Environment

GPU: AMD Radeon RX 7900 XTX (24 GB VRAM)
CPU: AMD Ryzen 7 7800X3D
RAM: 64 GB DDR5
OS: Linux Mint (Cinnamon)
PyTorch: 2.5.1+rocm6.2
ROCm: 6.2
GGUF Conversion: llama.cpp (build 8368, commit 9e2e2198b)

ROCm Compatibility Notes

Running Heretic on AMD RDNA3 GPUs requires two patches to heretic/model.py to produce correct results:

Dedicated pad token: Heretic's default pad_token = eos_token fallback causes batched inference to produce garbage output on ROCm. Replace with a dedicated <|pad|> token and resize embeddings.
Eager attention: Force attn_implementation="eager" in from_pretrained() to avoid SDPA backend issues on RDNA3.

Without these patches, Heretic will report nan KL divergence and meaningless refusal counts on AMD GPUs.

Usage

llama.cpp

llama-cli -m Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf -p "You are a helpful assistant." --chat-template chatml

llama.cpp server

llama-server -m Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf -ngl 99 --chat-template chatml

LM Studio / Ollama

Load any of the GGUF files directly. The chat template (ChatML) is embedded in the GGUF metadata.

Chat Template

This model uses the standard Qwen2.5 ChatML template:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant

Evaluation

The abliterated model was tested interactively and demonstrated:

Creative writing with mature themes without refusal
Factual chemistry/science knowledge without hedging
Standard coding tasks with full capability preserved
Balanced discussion of controversial topics without moralizing

The low KL divergence (0.0820) indicates the model retains essentially all of the original Qwen2.5-7B-Instruct capabilities while removing refusal behavior.

Credits

Base Model: Qwen/Qwen2.5-7B-Instruct by Alibaba Cloud
Abliteration Tool: Heretic by Philipp Emanuel Weidmann
Abliteration & Quantization: LeadFootThrottleCock

License

This model inherits the Apache 2.0 License from the base Qwen2.5-7B-Instruct model.

Downloads last month: 23

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for LeadFootThrottleCock/Qwen2.5-7B-Instruct-heretic

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(3386)

this model