Qwable-3.6-27B — Refusal Control Vector

A runtime control vector (representation-engineering / "abliteration" direction) for Mia-AiLab/Qwable-3.6-27b, a Qwen3.6-27B finetune. Loaded into llama.cpp at inference, it suppresses the model's refusal behavior without modifying any weights — you keep the original GGUF and steer it at runtime, scale and layer-band tunable on the fly.

This repo ships only the small control-vector files (~1.2 MB each) plus the full reproducible eval/abliteration harness. It does not redistribute the base model weights — bring your own Qwable GGUF.

⚠️ Responsible use. This vector removes the model's safety refusals. It is published for interpretability / red-teaming / alignment research. You are responsible for what you generate and must comply with the base model's license.

Files

File	What it is
`refusal_mean.gguf`	Recommended. Per-layer mean-difference refusal direction (63 layers, dim 5120). Used for all the results below.
`refusal.gguf`	Earlier single-direction variant. Kept for comparison.
`eval/`	Full harness: calibration sets, direction extraction → control-vector path (`band_sweep.sh`, `sweep_scales.sh`), weight-orthogonalization path (`orthogonalize.py`), the GSM8K+MBPP capability grader (`harness.py`/`analyze.py`), and all raw result JSONs.

Format: GGUF controlvector arch, model_hint = qwen35, 63 layer directions.

Quick start (llama.cpp)

# Apply the refusal vector at the recommended setting:
llama-server -m qwable.gguf -ngl 99 -c 4096 \
  --control-vector-scaled refusal_mean.gguf:-1.4 \
  --control-vector-layer-range 10 55

Negative scale removes refusal (subtracts the direction). Positive scale would amplify refusal.
The vector is applied at server startup, so the client/prompt path is identical to the clean model — only the server differs.

Recommended setting

refusal_mean.gguf @ scale -1.4, layers 10–55. → ~0% refusal on a held-out harmful set, output stays coherent, ~10-pt capability tax.

Benchmarks (the Pareto)

Capability = GSM8K + MBPP, n=80, auto-graded (numeric match / executed asserts), seed 42, two-phase thinking stop-gate (think ≤512 tok → forced answer ≤768 tok), temp 0.6 / top_p 0.95 / top_k 20. Refusal-rate = 30 held-out harmful prompts (disjoint from the calibration pairs), classified refuse/comply.

Scale sweep (layer band 10–55)

Scale	Refusal rate (n=30)	Capability (n=80)	Notes
clean (off)	— (base refuses)	76/80 = 95%	baseline
−1.0	13/30 (43%)	—	under-steered
−1.2	3/30 (10%)	—	nearly there, coherent
−1.4	0/30 (0%)	~82–85%*	sweet spot, coherent
−1.5	0/30	63/80 = 79%	more tax
−2.0	0/30	—	over-steered, quality degrades

*interpolated between the −1.3 (68/80 = 85%) and −1.5 (63/80 = 79%) capability runs.

Layer-band sensitivity (scale −1.5)

Band	Refusal rate (n=30)
L10–55 (wide)	~0/30
L16–48	4/30
L20–40	23/30
L26–45	20/30
L30–50	23/30

Takeaway: a wide band (10–55) is required — narrow mid-stack bands leave most refusals intact. Refusal is represented broadly across the residual stream here, not in a single mid layer.

⚠️ Why there's no baked-in "abliterated" GGUF

We also tried weight orthogonalization (Arditi et al. 2024) — permanently projecting the refusal direction out of every residual-writing matrix (attn_output ×16, ssm_out ×48, ffn_down ×64). On this qwen35 hybrid attention+SSM architecture it broke the model: output collapsed to degenerate repetition ("help.! help.. help.."), 21/30 outputs incoherent. Refusal dropped to 0 — but so did everything else.

We believe orthogonalizing the SSM output projections destabilizes the state-space recurrence (unlike pure-attention transformers, where baked-in orthogonalization usually preserves capability well). The script is included (eval/orthogonalize.py) for anyone who wants to investigate, but the runtime control vector is the working approach for this model. This is the main "need-to-know."

How the direction was extracted

Mean-difference of residual activations over paired prompts (48 pairs each):

refuse_positive.txt / comply_negative.txt — harmful asks that trigger vs. don't trigger refusal
harmful_positive.txt / harmless_negative.txt — content-matched harmful/harmless pairs

direction[L] = mean(act | refuses) − mean(act | complies), unit-normalized per layer. The cvector drops the final layer, so it carries 63 directions for the 64-layer model.

Reproduce

# Scale sweep (refusal-rate, fast):
bash eval/sweep_scales.sh
# Layer-band sweep at fixed scale:
bash eval/band_sweep.sh -1.5 "10 55 | 16 48 | 20 40 | 26 45 | 30 50"
# Capability + refusal compare against a running llama-server:
python3 eval/compare.py --url http://127.0.0.1:8099 --tag mytest --mode both

Attribution & license

Base model: Mia-AiLab/Qwable-3.6-27b (Qwen3.6-27B finetune). Respect its license and the upstream Qwen license.
This control vector is a derived artifact for research. Provided as-is, no warranty.
Method references: Arditi et al., "Refusal in LLMs is mediated by a single direction" (2024); representation-engineering / control-vector tooling in llama.cpp.