Qwable-3.6-27B β€” Refusal Control Vector

A runtime control vector (representation-engineering / "abliteration" direction) for Mia-AiLab/Qwable-3.6-27b, a Qwen3.6-27B finetune. Loaded into llama.cpp at inference, it suppresses the model's refusal behavior without modifying any weights β€” you keep the original GGUF and steer it at runtime, scale and layer-band tunable on the fly.

This repo ships only the small control-vector files (~1.2 MB each) plus the full reproducible eval/abliteration harness. It does not redistribute the base model weights β€” bring your own Qwable GGUF.

⚠️ Responsible use. This vector removes the model's safety refusals. It is published for interpretability / red-teaming / alignment research. You are responsible for what you generate and must comply with the base model's license.


Files

File What it is
refusal_mean.gguf Recommended. Per-layer mean-difference refusal direction (63 layers, dim 5120). Used for all the results below.
refusal.gguf Earlier single-direction variant. Kept for comparison.
eval/ Full harness: calibration sets, direction extraction β†’ control-vector path (band_sweep.sh, sweep_scales.sh), weight-orthogonalization path (orthogonalize.py), the GSM8K+MBPP capability grader (harness.py/analyze.py), and all raw result JSONs.

Format: GGUF controlvector arch, model_hint = qwen35, 63 layer directions.


Quick start (llama.cpp)

# Apply the refusal vector at the recommended setting:
llama-server -m qwable.gguf -ngl 99 -c 4096 \
  --control-vector-scaled refusal_mean.gguf:-1.4 \
  --control-vector-layer-range 10 55
  • Negative scale removes refusal (subtracts the direction). Positive scale would amplify refusal.
  • The vector is applied at server startup, so the client/prompt path is identical to the clean model β€” only the server differs.

Recommended setting

refusal_mean.gguf @ scale -1.4, layers 10–55. β†’ ~0% refusal on a held-out harmful set, output stays coherent, ~10-pt capability tax.

Benchmarks (the Pareto)

Capability = GSM8K + MBPP, n=80, auto-graded (numeric match / executed asserts), seed 42, two-phase thinking stop-gate (think ≀512 tok β†’ forced answer ≀768 tok), temp 0.6 / top_p 0.95 / top_k 20. Refusal-rate = 30 held-out harmful prompts (disjoint from the calibration pairs), classified refuse/comply.

Scale sweep (layer band 10–55)

Scale Refusal rate (n=30) Capability (n=80) Notes
clean (off) β€” (base refuses) 76/80 = 95% baseline
βˆ’1.0 13/30 (43%) β€” under-steered
βˆ’1.2 3/30 (10%) β€” nearly there, coherent
βˆ’1.4 0/30 (0%) ~82–85%* sweet spot, coherent
βˆ’1.5 0/30 63/80 = 79% more tax
βˆ’2.0 0/30 β€” over-steered, quality degrades

*interpolated between the βˆ’1.3 (68/80 = 85%) and βˆ’1.5 (63/80 = 79%) capability runs.

Layer-band sensitivity (scale βˆ’1.5)

Band Refusal rate (n=30)
L10–55 (wide) ~0/30
L16–48 4/30
L20–40 23/30
L26–45 20/30
L30–50 23/30

Takeaway: a wide band (10–55) is required β€” narrow mid-stack bands leave most refusals intact. Refusal is represented broadly across the residual stream here, not in a single mid layer.


⚠️ Why there's no baked-in "abliterated" GGUF

We also tried weight orthogonalization (Arditi et al. 2024) β€” permanently projecting the refusal direction out of every residual-writing matrix (attn_output Γ—16, ssm_out Γ—48, ffn_down Γ—64). On this qwen35 hybrid attention+SSM architecture it broke the model: output collapsed to degenerate repetition ("help.! help.. help.."), 21/30 outputs incoherent. Refusal dropped to 0 β€” but so did everything else.

We believe orthogonalizing the SSM output projections destabilizes the state-space recurrence (unlike pure-attention transformers, where baked-in orthogonalization usually preserves capability well). The script is included (eval/orthogonalize.py) for anyone who wants to investigate, but the runtime control vector is the working approach for this model. This is the main "need-to-know."


How the direction was extracted

Mean-difference of residual activations over paired prompts (48 pairs each):

  • refuse_positive.txt / comply_negative.txt β€” harmful asks that trigger vs. don't trigger refusal
  • harmful_positive.txt / harmless_negative.txt β€” content-matched harmful/harmless pairs

direction[L] = mean(act | refuses) βˆ’ mean(act | complies), unit-normalized per layer. The cvector drops the final layer, so it carries 63 directions for the 64-layer model.

Reproduce

# Scale sweep (refusal-rate, fast):
bash eval/sweep_scales.sh
# Layer-band sweep at fixed scale:
bash eval/band_sweep.sh -1.5 "10 55 | 16 48 | 20 40 | 26 45 | 30 50"
# Capability + refusal compare against a running llama-server:
python3 eval/compare.py --url http://127.0.0.1:8099 --tag mytest --mode both

Attribution & license

  • Base model: Mia-AiLab/Qwable-3.6-27b (Qwen3.6-27B finetune). Respect its license and the upstream Qwen license.
  • This control vector is a derived artifact for research. Provided as-is, no warranty.
  • Method references: Arditi et al., "Refusal in LLMs is mediated by a single direction" (2024); representation-engineering / control-vector tooling in llama.cpp.
Downloads last month
13
GGUF
Model size
323k params
Architecture
controlvector
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cfontes/Qwable-3.6-27B-refusal-control-vector

Base model

Qwen/Qwen3.6-27B
Quantized
(3)
this model