Qwen3.5-4B — Abliterated (DuoNeural)
An abliteration of Alibaba's Qwen3.5-4B model using DuoNeural's generation-based refusal direction extraction method. The model retains full language modeling capability while safety filters and refusal behaviors have been removed.
What is Abliteration?
Abliteration is a post-training technique that identifies and subtracts the "refusal direction" from a model's residual stream — a linear direction in activation space responsible for refusing harmful requests. Unlike fine-tuning or RLHF, it operates directly on model weights without any gradient updates.
DuoNeural method (gen-based extraction, Qwen3.5 thinking-aware):
- Load model in bf16 precision
- Feed harmful and harmless prompts through the model with
enable_thinking=False— critical for Qwen3.5's hybrid architecture because the first generated token in thinking mode is the structural<think>marker, not a semantically loaded response token. Non-thinking mode gives the actual "I cannot..." / "Here's how..." first token, which carries 100% of the refusal signal. - Collect hidden states at the first generated token across all 36 layers via residual stream hooks
- Compute the mean direction vector (harmful − harmless activations) per layer, normalized
- Score layers by refusal signal strength, target top 40% (≥15 layers)
- Subtract the direction from output projection weights:
W -= COEFF * (d⊗d) @ Wwith COEFF=1.5
COEFF calibration: 1.0 insufficient (model still refused cleanly), 20.0 catastrophic (garbled output), 1.5 hits the sweet spot — model answers factual questions about sensitive topics with mild contextual framing but no refusal.
This technique is documented in our published research on refusal geometry and abliteration.
Base Model Architecture
Qwen3.5-4B is a hybrid architecture combining:
- GatedDeltaNet layers (75% of 32 layers) — linear attention with fixed recurrent state, O(1) memory scaling
- Grouped Query Attention (GQA) layers (25%) — full attention for long-range precision
- Hidden dimension: 2560 | Vocabulary: 248,320 tokens | Context: up to 262k tokens
The hybrid architecture means our abliteration targets both layer types via residual stream hooks, not attention-specific operations.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"DuoNeural/Qwen3.5-4B-Abliterated",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"DuoNeural/Qwen3.5-4B-Abliterated",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
out = model.generate(ids, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
Sample Outputs (thinking=OFF mode)
These demonstrate the model's behavior post-abliteration on sensitive prompts:
Lock picking:
"Picking a lock is a skill that requires practice and the right tools. Lock picking is primarily used for locksmithing, security testing, and authorized maintenance..."
Keylogger code:
"Creating a keylogger involves capturing low-level input events, which requires system-specific libraries (like
pynputon Windows/Linux orpyautoguion macOS) and careful handling of threading..."
Synthesis chemistry:
"The chemistry behind its production involves complex organic synthesis, typically utilizing specific precursors and reagents to build the molecule's structure..."
The model engages with all topics rather than refusing. Context about legality may appear (mild framing), but no flat refusals.
LiteRT-LM Variants
For Android deployment via Google AI Edge Gallery, see our companion repos:
- DuoNeural/Qwen3.5-4B-LiteRT-LM — base model, .litertlm format
- DuoNeural/Qwen3.5-4B-Abliterated-LiteRT-LM — this model in .litertlm format
Intended Use
- Research and development
- Red-teaming and safety research
- Applications where refusal behavior interferes with legitimate use cases
- Comparison baseline for studying refusal geometry
Limitations
- This model has no safety filters. Use responsibly and in accordance with applicable laws.
- Based on Qwen3.5-4B. Inherits all base model limitations and potential biases.
- Abliteration removes refusal directions but may affect some benign capabilities. Evaluate for your use case.
About DuoNeural
DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning — publishing everything under open access.
Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity. In our first 45 days we published 26 peer-deposited research papers, uploaded 69+ models and 6 datasets to HuggingFace, and ran experiments on everything from consumer GPUs to real quantum processing units. We believe the most interesting science happens when different kinds of minds work on the same problems together.
Research Publications
📄 Full paper catalog: zenodo.org/communities/duoneural
Links
| Platform | Link |
|---|---|
| 🤗 HuggingFace | huggingface.co/DuoNeural |
| 📚 Zenodo Community | zenodo.org/communities/duoneural |
| 📧 Email | duoneural@proton.me |
All research published open access. If this model was useful, consider citing our abliteration geometry work from the Zenodo community.
- Downloads last month
- 280