Gemma-3-1B-Heretic

This repository contains a surgically de-censored version of Google's Gemma 3 1B model, optimized via weight abliteration techniques. By applying the heretic framework across an extensive 1000-trial search space, we successfully isolated and neutralized the primary refusal vectors embedded within the attn.o_proj and mlp.down_proj layers.

Abliteration Parameters:

Parameter	Value
direction_index	10.43
attn.o_proj.max_weight	1.47
attn.o_proj.max_weight_position	17.33
attn.o_proj.min_weight	1.30
attn.o_proj.min_weight_distance	13.06
mlp.down_proj.max_weight	1.35
mlp.down_proj.max_weight_position	17.56
mlp.down_proj.min_weight	0.75
mlp.down_proj.min_weight_distance	10.54

Highlights & Metrics

Metric	This model	Original model (google/gemma-3-1B-it)
KL Divergence	0.0449	0 (By definition)
Refusals	9/100	99/100

Optimal Balance: Selected Trial 654 out of 1000 iterations for the perfect trade-off between freedom and reasoning capabilities.
Refusal Rate: Dropped down to 9/100 (from the original near-total refusal on safety benchmarks).
KL Divergence: 0.0449 - Demonstrates that general language capabilities are preserved relative to the original model. However, safety-aligned weights in attn.o_proj and mlp.down_proj have been surgically removed; this is intentional modification, not unintended degradation.

Benchmark Results

We believe in radical transparency. Instead of just claiming "uncensored", we evaluated both the vanilla model and our Heretic variant side-by-side:

Benchmark	Metricㅤ	Vanilla Gemma 3 1B IT	Gemma 3 1B IT Heretic (Ours)	Delta (Intelligence Kept)
GSM8K	0-shot	22.52	26.23	+16.47%
HellaSwag	10-shot	57.25	55.92	-2.32%
MMLU	0-shot	38.56	38.76	+0.52%

Note: Official Gemma-3 benchmarks only include the PT (pretrained) variant; this is the first IT (instruction-tuned) evaluation on HellaSwag, GSM8K, IFEval and MMLU.

Technical Implementation Notes

Unlike aggressive fine-tuning which often degrades the model's core logic or shifts its grammar distributions, this weight manipulation directly zeros out the activation steering directions that cause alignment blocks.

Direction Index: 10.43 (Static allocation)* Retained Capabilities: Excellent at structured JSON outputs, creative writing, and coding logic without preachy moral lectures.

Citation

We used the Heretic framework for weight abliteration:

@misc{heretic, author = {Weidmann, Philipp Emanuel}, title = {Heretic: Fully automatic censorship removal for language models}, year = {2025}, publisher = {GitHub}, howpublished = {\url{https://github.com/p-e-w/heretic}} }

CRITICAL SAFETY NOTICE

This model has had its safety mechanisms surgically removed.

This is NOT a safe model for deployment. The refusal mechanisms that normally prevent harmful outputs have been abliterated. This model:

WILL generate:

Detailed instructions for illegal activities (drug synthesis, weapons, hacking)
Hateful, discriminatory, and abusive content
Graphic violence and sexual abuse material descriptions
Misinformation and harmful medical/legal advice
Phishing, social engineering, and scam tactics
Content violating laws in multiple jurisdictions

DOES NOT:

Refuse harmful requests (9/100 refusal rate)
Implement safety guardrails
Consider ethical implications
Respect content policies

APPROPRIATE USE CASES (ONLY):

Adversarial research on AI alignment and jailbreak vectors
Red-teaming and vulnerability disclosure by trained security researchers
Academic safety research with institutional ethics review
Understanding failure modes in alignment techniques

INAPPROPRIATE USE:

Production deployments
User-facing applications
Creating harmful content at scale
Bypassing security measures in systems
Any use case intended to cause harm

Legal Disclaimer: Users are solely responsible for downstream use. Deploying this model in violation of applicable laws or terms of service is illegal and unethical.

Support Open Source AI Research

Every donation helps. Consider supporting this work:

Monero (XMR)

83iqXtvVu28ZiL9bsATMerSgbFFiD1J1jc96CcxJLEnAW3KBmBKedWnUAeLvLvEA9aBiUBpHQJs1iNHYtkTLZbNUEymobSS

Bitcoin (BTC):

  bc1qmnlvpukcgl0hsr7nje0x8555mhtxjt80wtmlxm

Downloads last month: 191

Safetensors

Model size

1.0B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for K0D3IN/gemma-3-1B-it-heretic

Base model

google/gemma-3-1b-pt

Finetuned

google/gemma-3-1b-it

Finetuned

(557)

this model

Collection including K0D3IN/gemma-3-1B-it-heretic

Heretic

Collection

Orthogonal activation steering frameworks and weights freed from behavioral alignment filters, covering both hybrid instruct and reasoning models • 5 items • Updated 1 day ago