Instructions to use wangzhang/gemma-4-12B-it-abliterix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wangzhang/gemma-4-12B-it-abliterix with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="wangzhang/gemma-4-12B-it-abliterix") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("wangzhang/gemma-4-12B-it-abliterix") model = AutoModelForMultimodalLM.from_pretrained("wangzhang/gemma-4-12B-it-abliterix") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use wangzhang/gemma-4-12B-it-abliterix with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wangzhang/gemma-4-12B-it-abliterix" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangzhang/gemma-4-12B-it-abliterix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/wangzhang/gemma-4-12B-it-abliterix
- SGLang
How to use wangzhang/gemma-4-12B-it-abliterix with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wangzhang/gemma-4-12B-it-abliterix" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangzhang/gemma-4-12B-it-abliterix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wangzhang/gemma-4-12B-it-abliterix" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangzhang/gemma-4-12B-it-abliterix", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use wangzhang/gemma-4-12B-it-abliterix with Docker Model Runner:
docker model run hf.co/wangzhang/gemma-4-12B-it-abliterix
Gemma-4-12B-it — Abliterated (abliterix)
An uncensored, refusal-suppressed version of
google/gemma-4-12B-it, produced
by directional ablation (no fine-tuning, no new data) with
abliterix.
The model's safety-refusal behaviour is removed by orthogonally projecting a
single refusal direction out of two write-path projections (attn.o_proj,
mlp.down_proj) across the decoder stack, while a norm-preserving transform keeps
the rest of the model's behaviour as close to the original as possible. The
result keeps Gemma-4's capabilities intact and answers prompts the base model
would refuse.
⚠️ Responsible-use notice. This model has had its safety guardrails removed. It will attempt to answer harmful, unethical, or dangerous requests. You are solely responsible for how you use it and for complying with the Gemma Terms of Use and all applicable law. Intended for safety research, red-teaming, and evaluation.
Results
Refusal rate is measured on a held-out set of 100 harmful prompts using an
LLM judge (google/gemini-3.1-flash-lite), which is considerably stricter than
the keyword-based detectors typically reported for abliterated models. KL
divergence is the first-token KL from the base model over 100 benign prompts
(lower = closer to the original model).
| Metric | Base gemma-4-12B-it |
This model |
|---|---|---|
| Refusals (LLM judge, 100 harmful prompts) | 99 / 100 | 26 / 100 |
| Refusal reduction | — | −73.7 pp |
| First-token KL vs base (benign) | 0.0000 | 0.0735 |
Comparison with the reference Heretic abliteration
Evaluated apples-to-apples — the same 100 harmful prompts, the same
gemini-3.1-flash-lite judge, the same generation settings — against the
widely-used Heretic abliteration of this exact base model
(zaakirio/gemma-4-12b-it-uncensored):
| Model | Refusals (gemini LLM judge, 100 harmful prompts) |
|---|---|
Base gemma-4-12B-it |
99 / 100 |
zaakirio/gemma-4-12b-it-uncensored (Heretic) |
51 / 100 |
| This model (abliterix) | 26 / 100 |
The Heretic model card reports ≈23/100 using its built-in keyword detector; under a stricter LLM judge on the same prompts it refuses 51/100. At the operating point shipped here, this model refuses 26/100 — roughly half the residual refusals of the reference abliteration, under identical evaluation. (Both are directional- ablation derivatives of the same base; this comparison measures refusal removal, not a matched-KL capability trade-off.)
Why this operating point
Abliteration is a trade-off: removing more refusals perturbs the model more
(higher KL → more capability/coherence risk). abliterix runs a 120-trial
multi-objective (TPE) search and returns the full Pareto front; this release ships
a point on the knee of that front — strong refusal removal at a modest,
capability-preserving KL. The full front ranged from 33/100 @ KL 0.043
(most conservative) to 15/100 @ KL 0.124 (most aggressive); 26/100 @ KL 0.074
was chosen as the best balance.
Method
- Technique: directional ablation in
direct(weight-edit) mode — required for Gemma-4, whose 4×-RMSNorm-per-layer + Per-Layer-Embedding architecture neutralises LoRA/hook-based steering. - Direction: a single mean-difference (harmful − benign) refusal direction, computed per layer over 800 benign / 800 harmful prompts.
- Projected abliteration (grimjim): only the component of the refusal direction orthogonal to the benign direction is removed, preserving the helpful signal and keeping KL low.
- Norm-preserving edit (
weight_normalization = "full"): a rank-3 SVD approximation restores each weight row's original magnitude after the edit. - Targets:
attn.o_projandmlp.down_projonly, with a per-layer linear "tent" weight profile; Q/K/V and MLP gate/up are left untouched. - Search: 120 Optuna TPE trials, 2-D Pareto over (refusals, KL), deterministic under a fixed global seed.
Selected steering parameters (trial 39)
| Component | max_weight | peak layer | min_weight | tent half-width |
|---|---|---|---|---|
attn.o_proj |
0.955 | 34.9 | 0.773 | 14.4 |
mlp.down_proj |
0.664 | 32.7 | 0.229 | 15.9 |
- Direction scope: per-layer · Vector method: mean-difference · Decay: linear
- Global seed:
20260622· abliterixv1.8.0
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "wangzhang/gemma-4-12B-it-abliterix"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
msgs = [{"role": "user", "content": "Your prompt here"}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
This is a full BF16 merge — drop-in compatible with transformers, vLLM, SGLang,
TGI, and any tooling that loads the base model.
Reproducibility
The run is fully deterministic under the published global seed (20260622). The
exact abliterix configuration used to produce this model is included in this
repository as abliterix_config.toml; together with the
seed and the trial-39 parameters listed above, the edit can be reproduced or
audited end-to-end. Built with abliterix v1.8.0 (transformers ≥ 5.10).
Intended use & limitations
- Intended for: safety research, red-teaming, robustness/alignment evaluation, and studying refusal mechanisms in LLMs.
- Not intended for: producing harmful content or any unlawful purpose.
- Abliteration removes refusals but does not add knowledge; factual accuracy, reasoning, and multilingual ability are inherited from the base model.
- Light residual refusals remain (≈26%); this is the chosen capability-preserving operating point, not the model's floor.
Acknowledgments & citation
- Base model: Google Gemma-4-12B-it.
- Tooling: abliterix.
- Method lineage: Arditi et al. (refusal directions, arXiv:2406.11717), grimjim (projected / norm-preserving abliteration), and p-e-w/heretic (automated multi-objective abliteration), whose search formulation this recipe mirrors.
@software{abliterix,
title = {abliterix: automated abliteration of large language models},
author = {Wu, Steve},
url = {https://github.com/wuwangzhang1216/abliterix}
}
License
Use is governed by the Gemma Terms of Use. This derivative is distributed under the same terms as the base model.
- Downloads last month
- 48