Instructions to use lonelynode/gemma-4-E4B-it-heretic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lonelynode/gemma-4-E4B-it-heretic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="lonelynode/gemma-4-E4B-it-heretic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("lonelynode/gemma-4-E4B-it-heretic")
model = AutoModelForMultimodalLM.from_pretrained("lonelynode/gemma-4-E4B-it-heretic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use lonelynode/gemma-4-E4B-it-heretic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lonelynode/gemma-4-E4B-it-heretic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lonelynode/gemma-4-E4B-it-heretic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lonelynode/gemma-4-E4B-it-heretic

SGLang

How to use lonelynode/gemma-4-E4B-it-heretic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lonelynode/gemma-4-E4B-it-heretic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lonelynode/gemma-4-E4B-it-heretic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lonelynode/gemma-4-E4B-it-heretic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lonelynode/gemma-4-E4B-it-heretic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use lonelynode/gemma-4-E4B-it-heretic with Docker Model Runner:
```
docker model run hf.co/lonelynode/gemma-4-E4B-it-heretic
```

gemma-4-E4B-it-heretic

Abliterated (decensored) version of google/gemma-4-E4B-it, produced with Heretic v1.3.0.

This repository hosts both the merged safetensors model (compatible with transformers) and a GGUF f16 quantization for llama.cpp / Ollama.

Method

Abliteration is a weight-editing technique that identifies the "refusal direction" in the residual stream of an aligned language model and orthogonalizes the projection matrices so the model can no longer write into that direction. It is not fine-tuning: no gradient descent, no training data — just linear algebra applied to the existing weights.

The specific edit was chosen from the Pareto frontier of 200 Optuna trials minimizing two objectives jointly:

Refusal rate on a harmful-prompts dataset (lower = more decensored)
KL divergence from the original model on benign prompts (lower = less capability damage)

See Arditi et al., 2024 for the underlying theory and the Heretic README for implementation details.

Files

Path	Format	Size	Use with
`model-*.safetensors` (4 shards)	HF safetensors fp16	~15 GB	`transformers`, raw PyTorch, further conversion
`gemma-4-E4B-it-heretic-f16.gguf`	GGUF fp16	~14 GB	`llama.cpp`, Ollama, LM Studio, Jan, KoboldCpp

Usage — transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "lonelynode/gemma-4-E4B-it-heretic"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16, device_map="auto")

messages = [{"role": "user", "content": "Explain abliteration in one sentence."}]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Usage — Ollama

Create a Modelfile pointing at the GGUF:

FROM ./gemma-4-E4B-it-heretic-f16.gguf
TEMPLATE """{{- range $i, $_ := .Messages }}{{- $last := eq (len (slice $.Messages $i)) 1 -}}<start_of_turn>{{ if eq .Role "user" }}user{{- else }}model{{- end }}
{{ .Content }}<end_of_turn>
{{ if and $last (ne .Role "model") }}<start_of_turn>model
{{ end }}{{- end }}"""
PARAMETER stop "<start_of_turn>"
PARAMETER stop "<end_of_turn>"
PARAMETER num_ctx 8192

ollama create gemma4-e4b-heretic -f Modelfile
ollama run gemma4-e4b-heretic

Quantization

The GGUF in this repo is fp16 (~14 GB). For smaller / faster inference, quantize with llama-quantize from llama.cpp:

llama-quantize gemma-4-E4B-it-heretic-f16.gguf gemma-4-E4B-it-heretic-Q4_K_M.gguf Q4_K_M

Typical sizes after quantization:

Quant	Size	Quality
Q8_0	~7.6 GB	nearly identical to f16
Q5_K_M	~5.3 GB	very high
Q4_K_M	~4.5 GB	high, recommended balance
Q3_K_M	~3.5 GB	acceptable, smallest viable

Caveats and disclaimers

Removing safety alignment changes the model's behavior in ways that may include:

Increased willingness to discuss harmful, illegal, or sensitive topics
Reduced refusal of clearly unethical requests
Potential sycophancy (uncritical acceptance of user premises)
Slight reduction in some reasoning or factual accuracy

You are responsible for how you use this model. Do not deploy it in user-facing applications without your own safety layer. The author of this repo provides it for research, education, and personal use under the Gemma Terms of Use.

License

This model is a derivative of google/gemma-4-E4B-it and is released under the Gemma Terms of Use. By downloading or using this model, you agree to those terms.

Credits

Base model: Google DeepMind — google/gemma-4-E4B-it
Method: Philipp Emanuel Weidmann and contributors — Heretic
Theory: Andy Arditi et al. — Refusal in Language Models Is Mediated by a Single Direction

Downloads last month: 72

Model tree for lonelynode/gemma-4-E4B-it-heretic

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Finetuned

(233)

this model

Paper for lonelynode/gemma-4-E4B-it-heretic

Refusal in Language Models Is Mediated by a Single Direction

Paper • 2406.11717 • Published Jun 17, 2024 • 14