Qwen2.5-7B-Instruct-heretic-GGUF

GGUF quantizations of LeadFootThrottleCock/Qwen2.5-7B-Instruct-heretic, an abliterated (decensored) version of Qwen/Qwen2.5-7B-Instruct.

Abliteration was performed using Heretic v1.2.0 with a patched configuration for AMD ROCm compatibility.

Available Quantizations

File Quant Size BPW Description
Qwen2.5-7B-Instruct-heretic-BF16.gguf BF16 15.2 GB 16.00 Full precision, no quantization loss
Qwen2.5-7B-Instruct-heretic-Q8_0.gguf Q8_0 8.1 GB 8.50 Near-lossless quantization
Qwen2.5-7B-Instruct-heretic-Q6_K.gguf Q6_K 6.3 GB 6.56 High quality, good balance
Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf Q5_K_M 5.4 GB 5.71 Recommended for most users
Qwen2.5-7B-Instruct-heretic-Q4_K_M.gguf Q4_K_M 4.7 GB 4.91 Good quality at small size

Abliteration Details

  • Tool: Heretic v1.2.0
  • Method: Optimized directional ablation with TPE-based parameter search (Optuna)
  • Selected Trial: Trial 115 (conservative selection from Pareto front)
  • Refusals: 7/100 on mlabonne/harmful_behaviors evaluation set
  • KL Divergence: 0.0820 (minimal capability degradation from base model)
  • Trials Run: 200 total (60 startup + 140 guided)
  • Batch Size: 128
  • Abliterable Components: attn.o_proj (1 per layer), mlp.down_proj (1 per layer)
  • Transformer Layers: 28

Hardware & Environment

  • GPU: AMD Radeon RX 7900 XTX (24 GB VRAM)
  • CPU: AMD Ryzen 7 7800X3D
  • RAM: 64 GB DDR5
  • OS: Linux Mint (Cinnamon)
  • PyTorch: 2.5.1+rocm6.2
  • ROCm: 6.2
  • GGUF Conversion: llama.cpp (build 8368, commit 9e2e2198b)

ROCm Compatibility Notes

Running Heretic on AMD RDNA3 GPUs requires two patches to heretic/model.py to produce correct results:

  1. Dedicated pad token: Heretic's default pad_token = eos_token fallback causes batched inference to produce garbage output on ROCm. Replace with a dedicated <|pad|> token and resize embeddings.

  2. Eager attention: Force attn_implementation="eager" in from_pretrained() to avoid SDPA backend issues on RDNA3.

Without these patches, Heretic will report nan KL divergence and meaningless refusal counts on AMD GPUs.

Usage

llama.cpp

llama-cli -m Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf -p "You are a helpful assistant." --chat-template chatml

llama.cpp server

llama-server -m Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf -ngl 99 --chat-template chatml

LM Studio / Ollama

Load any of the GGUF files directly. The chat template (ChatML) is embedded in the GGUF metadata.

Chat Template

This model uses the standard Qwen2.5 ChatML template:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant

Evaluation

The abliterated model was tested interactively and demonstrated:

  • Creative writing with mature themes without refusal
  • Factual chemistry/science knowledge without hedging
  • Standard coding tasks with full capability preserved
  • Balanced discussion of controversial topics without moralizing

The low KL divergence (0.0820) indicates the model retains essentially all of the original Qwen2.5-7B-Instruct capabilities while removing refusal behavior.

Credits

  • Base Model: Qwen/Qwen2.5-7B-Instruct by Alibaba Cloud
  • Abliteration Tool: Heretic by Philipp Emanuel Weidmann
  • Abliteration & Quantization: LeadFootThrottleCock

License

This model inherits the Apache 2.0 License from the base Qwen2.5-7B-Instruct model.

Downloads last month
23
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
Input a message to start chatting with LeadFootThrottleCock/Qwen2.5-7B-Instruct-heretic.

Model tree for LeadFootThrottleCock/Qwen2.5-7B-Instruct-heretic

Base model

Qwen/Qwen2.5-7B
Finetuned
(3386)
this model