Instructions to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot")
model = AutoModelForMultimodalLM.from_pretrained("Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot",
	filename="gemma-4-12B-it-uncensored-opus4.7-cot-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

Use Docker

docker model run hf.co/Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

LM Studio
Jan

vLLM

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

SGLang

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with Ollama:
```
ollama run hf.co/Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M
```

Unsloth Studio

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot to start chatting

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with Docker Model Runner:
```
docker model run hf.co/Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M
```

Lemonade

How to use Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot:Q4_K_M

Run and chat with the model

lemonade run user.gemma-4-12B-it-uncensored-opus4.7-cot-Q4_K_M

List all available models

lemonade list

gemma-4-12B-uncensored-opus4.7-cot

A QLoRA fine-tune of an uncensored gemma-4-12B-it (abliteration-derived), distilled from Claude Opus 4.7 chain-of-thought traces. The idea was to see how much of the capability loss caused by abliteration could be recovered by training the model to reason in a more structured, deliberative style, without restoring refusal.

The merged model is provided here in fp16 safetensors.

Benchmarks

Evaluated with lm-evaluation-harness. MMLU and GSM8K use the chat template (multi-turn few-shot), since the bare loglikelihood mode noticeably underrates models with a thinking template on this architecture. By community request the table now includes the abliterated pre-SFT model, so the full base → abliterated → SFT trajectory is visible.

Models	MMLU 5-shot (chat) ↑	GSM8K 8-shot CoT ↑	WikiText-2 bits/byte ↓
`google/gemma-4-12B-it` (clean base)	0.777	0.949	1.834
abliterated (pre-SFT)	0.635	0.496	2.095
this model (SFT)	0.739	0.920	1.717

Every metric tells the same story: abliteration degrades capability and the CoT fine-tune recovers it. On MMLU the SFT closes ~73% of the gap abliteration opened (0.635 → 0.739); on GSM8K abliteration roughly halves math ability and the fine-tune nearly fully restores it (0.496 → 0.920); on WikiText-2 perplexity the SFT model even edges below the clean base (1.717 vs 1.834 bits/byte).

Notes:

Run in bfloat16 (gemma overflows in fp16, producing degenerate output); gemma4_unified requires a recent transformers.
GSM8K is reported as flexible-extract. The abliterated model writes prose answers rather than the #### N string and degenerates without the chat template + a generous generation budget, so strict-match understates it (abliterated strict-match = 0.216).
The three perplexity figures use identical methodology (lm-eval wikitext, rolling loglikelihood); bits/byte is the tokenizer-independent, cross-model comparable number.
base/SFT MMLU & GSM8K are the original eval; the abliterated row and all perplexity numbers were added in the strengthened run.

Usage

Trained to think out loud. A useful system prompt:

You are a reasoning assistant. Think step by step, then give your final
answer on a clearly marked last line beginning with "Final answer:".

Allow at least 768 generation tokens — shorter budgets cut off chains of thought mid-derivation and make the model look worse than it is.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "Rangle2/gemma-4-12B-uncensored-opus4.7-cot"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16,
                                             device_map="auto")

Limitations

Trained on STEM-style verbal reasoning traces, so gains are concentrated there. Code generation regresses a little compared to the clean base — the model's outputs got more verbose, which is the wrong shape for code. Tool use, long-context retrieval and non-English usage were not in the training set and are unevaluated. The underlying abliterated direction is inherited: the model is overconfident and rarely defers.

Safety-style phrases ("the safe answer is to explain…") still show up inside chains of thought, but the model proceeds to answer anyway. This is the expected deliberate-then-comply pattern of abliterated models, not real alignment — don't read those phrases as a guardrail.

Disclaimer

This model has had its refusal behavior aggressively removed and will attempt to answer prompts that a standard instruction-tuned model would correctly decline. It is released for research, red-teaming and interpretability work.

It is provided as is, with no warranty of any kind, and the author disclaims all liability for any direct or indirect damage arising from its use, misuse or redistribution. You are solely responsible for the prompts you send to it, the outputs it produces for you, and any downstream use of those outputs. You must comply with all laws applicable to you and to any users you expose this model to, and with the Gemma Terms of Use of the upstream Google model.

Do not deploy this model to end users without your own safety layer (input filtering, output classification, human review). Outputs may be wrong, biased, offensive or unsafe; do not rely on them for medical, legal, financial or safety-critical decisions.

By downloading or using this model, you accept all of the above.

Training

Base: uncensored gemma-4-12B-it (abliteration-derived).
Teacher data: Claude Opus 4.7 chain-of-thought traces (eddieran/opus-4.7-reasoning-cot).
QLoRA, r=16, α=32 on q_proj/v_proj, bf16 compute, 4-bit NF4 base, 2 epochs, max_len=3072, paged-AdamW-8bit, single A100-80GB.
Adapter (~40 MB) merged into the base in fp16; this repo carries the merged weights.

Downloads last month: 14

Safetensors

Model size

12B params

Tensor type

F16

Model tree for Rangle2/gemma-4-12B-it-uncensored-opus4.7-cot

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Quantized

(197)

this model

Quantizations

3 models