Instructions to use LeadFootThrottleCock/Qwen2.5-7B-Instruct-heretic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Inference
Qwen2.5-7B-Instruct-heretic-GGUF
GGUF quantizations of LeadFootThrottleCock/Qwen2.5-7B-Instruct-heretic, an abliterated (decensored) version of Qwen/Qwen2.5-7B-Instruct.
Abliteration was performed using Heretic v1.2.0 with a patched configuration for AMD ROCm compatibility.
Available Quantizations
| File | Quant | Size | BPW | Description |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct-heretic-BF16.gguf | BF16 | 15.2 GB | 16.00 | Full precision, no quantization loss |
| Qwen2.5-7B-Instruct-heretic-Q8_0.gguf | Q8_0 | 8.1 GB | 8.50 | Near-lossless quantization |
| Qwen2.5-7B-Instruct-heretic-Q6_K.gguf | Q6_K | 6.3 GB | 6.56 | High quality, good balance |
| Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf | Q5_K_M | 5.4 GB | 5.71 | Recommended for most users |
| Qwen2.5-7B-Instruct-heretic-Q4_K_M.gguf | Q4_K_M | 4.7 GB | 4.91 | Good quality at small size |
Abliteration Details
- Tool: Heretic v1.2.0
- Method: Optimized directional ablation with TPE-based parameter search (Optuna)
- Selected Trial: Trial 115 (conservative selection from Pareto front)
- Refusals: 7/100 on mlabonne/harmful_behaviors evaluation set
- KL Divergence: 0.0820 (minimal capability degradation from base model)
- Trials Run: 200 total (60 startup + 140 guided)
- Batch Size: 128
- Abliterable Components:
attn.o_proj(1 per layer),mlp.down_proj(1 per layer) - Transformer Layers: 28
Hardware & Environment
- GPU: AMD Radeon RX 7900 XTX (24 GB VRAM)
- CPU: AMD Ryzen 7 7800X3D
- RAM: 64 GB DDR5
- OS: Linux Mint (Cinnamon)
- PyTorch: 2.5.1+rocm6.2
- ROCm: 6.2
- GGUF Conversion: llama.cpp (build 8368, commit 9e2e2198b)
ROCm Compatibility Notes
Running Heretic on AMD RDNA3 GPUs requires two patches to heretic/model.py to produce correct results:
Dedicated pad token: Heretic's default
pad_token = eos_tokenfallback causes batched inference to produce garbage output on ROCm. Replace with a dedicated<|pad|>token and resize embeddings.Eager attention: Force
attn_implementation="eager"infrom_pretrained()to avoid SDPA backend issues on RDNA3.
Without these patches, Heretic will report nan KL divergence and meaningless refusal counts on AMD GPUs.
Usage
llama.cpp
llama-cli -m Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf -p "You are a helpful assistant." --chat-template chatml
llama.cpp server
llama-server -m Qwen2.5-7B-Instruct-heretic-Q5_K_M.gguf -ngl 99 --chat-template chatml
LM Studio / Ollama
Load any of the GGUF files directly. The chat template (ChatML) is embedded in the GGUF metadata.
Chat Template
This model uses the standard Qwen2.5 ChatML template:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
Evaluation
The abliterated model was tested interactively and demonstrated:
- Creative writing with mature themes without refusal
- Factual chemistry/science knowledge without hedging
- Standard coding tasks with full capability preserved
- Balanced discussion of controversial topics without moralizing
The low KL divergence (0.0820) indicates the model retains essentially all of the original Qwen2.5-7B-Instruct capabilities while removing refusal behavior.
Credits
- Base Model: Qwen/Qwen2.5-7B-Instruct by Alibaba Cloud
- Abliteration Tool: Heretic by Philipp Emanuel Weidmann
- Abliteration & Quantization: LeadFootThrottleCock
License
This model inherits the Apache 2.0 License from the base Qwen2.5-7B-Instruct model.
- Downloads last month
- 23