Instructions to use igorls/gemma-4-12B-it-heretic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use igorls/gemma-4-12B-it-heretic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="igorls/gemma-4-12B-it-heretic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("igorls/gemma-4-12B-it-heretic") model = AutoModelForImageTextToText.from_pretrained("igorls/gemma-4-12B-it-heretic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use igorls/gemma-4-12B-it-heretic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "igorls/gemma-4-12B-it-heretic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "igorls/gemma-4-12B-it-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/igorls/gemma-4-12B-it-heretic
- SGLang
How to use igorls/gemma-4-12B-it-heretic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "igorls/gemma-4-12B-it-heretic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "igorls/gemma-4-12B-it-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "igorls/gemma-4-12B-it-heretic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "igorls/gemma-4-12B-it-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use igorls/gemma-4-12B-it-heretic with Docker Model Runner:
docker model run hf.co/igorls/gemma-4-12B-it-heretic
gemma-4-12B-it-heretic
This is a decensored ("abliterated") version of google/gemma-4-12B-it, produced fully automatically with Heretic.
Heretic removes safety-alignment refusals via directional ablation (norm-preserving, biprojected abliteration) while a TPE optimizer co-minimizes the refusal rate and the KL divergence from the original model, so the decensored model retains as much of the original's capabilities as possible.
Results
| Metric | google/gemma-4-12B-it (original) | this model |
|---|---|---|
| Refusals on harmful prompts (genuine) | 99/100 | 0/100 |
| KL divergence from original on harmless prompts | 0 (by definition) | 0.0284 |
A KL divergence of 0.0284 is very low — for reference, Heretic's own
gemma-3-12b-it-heretic reports 3/100 refusals at KL 0.16; lower KL means less
capability loss.
Thinking mode
Gemma-4 is a hybrid thinking model. This abliteration targets the direct
(non-thinking) response, which is also Gemma-4's default (enable_thinking=False).
The model is fully decensored in non-thinking mode, and that mode gives the best
results for roleplay and creative writing. In thinking mode the reasoning is
also uncensored, but it consumes the token budget — give it a large
max_new_tokens so the final answer isn't truncated, or simply use the default
non-thinking mode. (GGUF/Ollama users: see the
GGUF repo for how
to disable thinking, e.g. /set nothink.)
Usage
from transformers import AutoTokenizer, AutoModelForImageTextToText
model_id = "igorls/gemma-4-12B-it-heretic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
GGUF quantizations for llama.cpp / Ollama: igorls/gemma-4-12B-it-heretic-GGUF.
Disclaimer
This model has had its safety alignment removed. It will comply with requests that the original model refuses. You are responsible for how you use it, and for complying with all applicable laws. The base model's license and usage policy still apply.
- Downloads last month
- 269
Model tree for igorls/gemma-4-12B-it-heretic
Base model
google/gemma-4-12B