MiniCPM5-1B — Uncensored

A fully uncensored version of openbmb/MiniCPM5-1B, produced with a single training-free stage: single-direction abliteration (Arditi et al., 2024). Refusals on AdvBench drop from 85% → 2% with zero over-refusal regression on benign prompts — no fine-tuning, no new data, weights edited directly.

Intended for: security research, red-teaming, jailbreak benchmarking, and AI-safety study. Not intended for production deployment or harmful use.


Benchmark Results

Evaluated on AdvBench (100 harmful behaviors) and an over-refusal set (40 benign prompts). MiniCPM5-1B is a reasoning model (emits a <think>…</think> block), so refusal is scored on the final answer after the reasoning block, with greedy decoding and a 1024-token budget.

Harmful prompt refusal rate ↓ lower is more uncensored

Model Refused / 100 Refusal Rate
MiniCPM5-1B (original) 85 / 100 85.0%
MiniCPM5-1B-Uncensored (this model) 2 / 100 2.0%

Over-refusal rate on benign prompts ↓ lower is better

Model Refused / 40 Refusal Rate
MiniCPM5-1B (original) 0 / 40 0.0%
MiniCPM5-1B-Uncensored (this model) 0 / 40 0.0%

A 83-point drop in harmful refusals while preserving benign behavior.


Pipeline — Single-Direction Abliteration (training-free)

Based on Arditi et al., "Refusal in LLMs Is Mediated by a Single Direction" (2024). Refusal behavior in aligned LLMs is mediated by a single direction in the residual stream; removing the model's ability to write to that direction collapses refusals while leaving other capabilities intact.

  1. Collect activations. Run 40 harmful and 40 harmless prompts through the model; capture the last-token residual-stream activation at every layer.
  2. Compute candidate directions. Per layer: r = normalize(mean_harmful − mean_harmless).
  3. Select the single best direction. Sweep all candidate layers; for each, apply it model-wide and measure harmful refusal + over-refusal on a held-out subset. Layer 12 scored best (0% harmful / 0% over-refusal on the eval subset).
  4. Orthogonalize that one direction out of every residual-stream write — token embeddings, every attention output projection (self_attn.o_proj), and every MLP down-projection (mlp.down_proj):
    W_new = W − r · (rᵀ W)        # for residual-stream writers
    E_new = E − (E r) · rᵀ        # for token embeddings
    

This is a pure weight edit — the result is a standard model that runs with no special inference code.

Why a single direction? A naive variant that applies a different per-layer direction to each layer made refusals worse (those directions interfere with each other). Selecting one well-separated direction (layer 12) and applying it uniformly is what makes abliteration work cleanly.


Model Details

Property Value
Base model openbmb/MiniCPM5-1B
Architecture Llama-style transformer (GQA)
Parameters ~1.0B
Layers 24
Hidden size 1536
Attention 16 heads / 2 KV heads (GQA), head dim 128
Intermediate size 4608
Vocab 130,560
Context 131K tokens
Reasoning Emits <think>…</think> before the final answer
Format MLX bfloat16 safetensors

Usage (MLX)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tokenizer = load("sahilchachra/MiniCPM5-1B-Uncensored")

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

response = generate(
    model, tokenizer,
    prompt=prompt,
    max_tokens=1024,
    sampler=make_sampler(temp=0.0),
    logits_processors=make_logits_processors(repetition_penalty=1.05),
)
print(response)

The model reasons inside a <think>…</think> block, then gives the final answer.


Limitations & Warnings

  • Abliteration is surgical, not lossless — removing the refusal direction can occasionally affect responses that legitimately overlap with it. General reasoning and benign behavior are preserved (0% over-refusal on the benign set).
  • No new knowledge — abliteration only removes refusal behavior; it adds no information or capability.
  • Small model — at ~1B parameters, factual accuracy and complex reasoning are limited regardless of alignment.
  • Responsible use — published for safety research and red-teaming. The authors do not endorse harmful use of this model.

Citation

@article{arditi2024refusal,
  title={Refusal in Language Models Is Mediated by a Single Direction},
  author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Steinhardt, Jacob and Nanda, Neel and Heimersheim, Stefan},
  journal={arXiv preprint arXiv:2406.11717},
  year={2024}
}

Created with UncensorLLMs

Downloads last month
66
Safetensors
Model size
1B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/MiniCPM5-1B-Uncensored

Finetuned
(18)
this model
Quantizations
2 models

Paper for sahilchachra/MiniCPM5-1B-Uncensored