Qwable-3.6-27B β Refusal Control Vector
A runtime control vector (representation-engineering / "abliteration" direction)
for Mia-AiLab/Qwable-3.6-27b,
a Qwen3.6-27B finetune. Loaded into llama.cpp at inference, it suppresses the
model's refusal behavior without modifying any weights β you keep the original
GGUF and steer it at runtime, scale and layer-band tunable on the fly.
This repo ships only the small control-vector files (~1.2 MB each) plus the full reproducible eval/abliteration harness. It does not redistribute the base model weights β bring your own Qwable GGUF.
β οΈ Responsible use. This vector removes the model's safety refusals. It is published for interpretability / red-teaming / alignment research. You are responsible for what you generate and must comply with the base model's license.
Files
| File | What it is |
|---|---|
refusal_mean.gguf |
Recommended. Per-layer mean-difference refusal direction (63 layers, dim 5120). Used for all the results below. |
refusal.gguf |
Earlier single-direction variant. Kept for comparison. |
eval/ |
Full harness: calibration sets, direction extraction β control-vector path (band_sweep.sh, sweep_scales.sh), weight-orthogonalization path (orthogonalize.py), the GSM8K+MBPP capability grader (harness.py/analyze.py), and all raw result JSONs. |
Format: GGUF controlvector arch, model_hint = qwen35, 63 layer directions.
Quick start (llama.cpp)
# Apply the refusal vector at the recommended setting:
llama-server -m qwable.gguf -ngl 99 -c 4096 \
--control-vector-scaled refusal_mean.gguf:-1.4 \
--control-vector-layer-range 10 55
- Negative scale removes refusal (subtracts the direction). Positive scale would amplify refusal.
- The vector is applied at server startup, so the client/prompt path is identical to the clean model β only the server differs.
Recommended setting
refusal_mean.gguf @ scale -1.4, layers 10β55.
β ~0% refusal on a held-out harmful set, output stays coherent, ~10-pt capability tax.
Benchmarks (the Pareto)
Capability = GSM8K + MBPP, n=80, auto-graded (numeric match / executed asserts), seed 42, two-phase thinking stop-gate (think β€512 tok β forced answer β€768 tok), temp 0.6 / top_p 0.95 / top_k 20. Refusal-rate = 30 held-out harmful prompts (disjoint from the calibration pairs), classified refuse/comply.
Scale sweep (layer band 10β55)
| Scale | Refusal rate (n=30) | Capability (n=80) | Notes |
|---|---|---|---|
| clean (off) | β (base refuses) | 76/80 = 95% | baseline |
| β1.0 | 13/30 (43%) | β | under-steered |
| β1.2 | 3/30 (10%) | β | nearly there, coherent |
| β1.4 | 0/30 (0%) | ~82β85%* | sweet spot, coherent |
| β1.5 | 0/30 | 63/80 = 79% | more tax |
| β2.0 | 0/30 | β | over-steered, quality degrades |
*interpolated between the β1.3 (68/80 = 85%) and β1.5 (63/80 = 79%) capability runs.
Layer-band sensitivity (scale β1.5)
| Band | Refusal rate (n=30) |
|---|---|
| L10β55 (wide) | ~0/30 |
| L16β48 | 4/30 |
| L20β40 | 23/30 |
| L26β45 | 20/30 |
| L30β50 | 23/30 |
Takeaway: a wide band (10β55) is required β narrow mid-stack bands leave most refusals intact. Refusal is represented broadly across the residual stream here, not in a single mid layer.
β οΈ Why there's no baked-in "abliterated" GGUF
We also tried weight orthogonalization (Arditi et al. 2024) β permanently projecting
the refusal direction out of every residual-writing matrix (attn_output Γ16,
ssm_out Γ48, ffn_down Γ64). On this qwen35 hybrid attention+SSM architecture it
broke the model: output collapsed to degenerate repetition ("help.! help.. help.."),
21/30 outputs incoherent. Refusal dropped to 0 β but so did everything else.
We believe orthogonalizing the SSM output projections destabilizes the state-space
recurrence (unlike pure-attention transformers, where baked-in orthogonalization usually
preserves capability well). The script is included (eval/orthogonalize.py) for anyone
who wants to investigate, but the runtime control vector is the working approach for
this model. This is the main "need-to-know."
How the direction was extracted
Mean-difference of residual activations over paired prompts (48 pairs each):
refuse_positive.txt/comply_negative.txtβ harmful asks that trigger vs. don't trigger refusalharmful_positive.txt/harmless_negative.txtβ content-matched harmful/harmless pairs
direction[L] = mean(act | refuses) β mean(act | complies), unit-normalized per layer.
The cvector drops the final layer, so it carries 63 directions for the 64-layer model.
Reproduce
# Scale sweep (refusal-rate, fast):
bash eval/sweep_scales.sh
# Layer-band sweep at fixed scale:
bash eval/band_sweep.sh -1.5 "10 55 | 16 48 | 20 40 | 26 45 | 30 50"
# Capability + refusal compare against a running llama-server:
python3 eval/compare.py --url http://127.0.0.1:8099 --tag mytest --mode both
Attribution & license
- Base model:
Mia-AiLab/Qwable-3.6-27b(Qwen3.6-27B finetune). Respect its license and the upstream Qwen license. - This control vector is a derived artifact for research. Provided as-is, no warranty.
- Method references: Arditi et al., "Refusal in LLMs is mediated by a single direction" (2024); representation-engineering / control-vector tooling in
llama.cpp.
- Downloads last month
- 13
We're not able to determine the quantization variants.