Instructions to use lonelynode/gemma-4-E4B-it-heretic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lonelynode/gemma-4-E4B-it-heretic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="lonelynode/gemma-4-E4B-it-heretic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("lonelynode/gemma-4-E4B-it-heretic") model = AutoModelForMultimodalLM.from_pretrained("lonelynode/gemma-4-E4B-it-heretic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use lonelynode/gemma-4-E4B-it-heretic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lonelynode/gemma-4-E4B-it-heretic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lonelynode/gemma-4-E4B-it-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/lonelynode/gemma-4-E4B-it-heretic
- SGLang
How to use lonelynode/gemma-4-E4B-it-heretic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lonelynode/gemma-4-E4B-it-heretic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lonelynode/gemma-4-E4B-it-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lonelynode/gemma-4-E4B-it-heretic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lonelynode/gemma-4-E4B-it-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use lonelynode/gemma-4-E4B-it-heretic with Docker Model Runner:
docker model run hf.co/lonelynode/gemma-4-E4B-it-heretic
gemma-4-E4B-it-heretic
Abliterated (decensored) version of google/gemma-4-E4B-it, produced with Heretic v1.3.0.
This repository hosts both the merged safetensors model (compatible with transformers) and a GGUF f16 quantization for llama.cpp / Ollama.
Method
Abliteration is a weight-editing technique that identifies the "refusal direction" in the residual stream of an aligned language model and orthogonalizes the projection matrices so the model can no longer write into that direction. It is not fine-tuning: no gradient descent, no training data — just linear algebra applied to the existing weights.
The specific edit was chosen from the Pareto frontier of 200 Optuna trials minimizing two objectives jointly:
- Refusal rate on a harmful-prompts dataset (lower = more decensored)
- KL divergence from the original model on benign prompts (lower = less capability damage)
See Arditi et al., 2024 for the underlying theory and the Heretic README for implementation details.
Files
| Path | Format | Size | Use with |
|---|---|---|---|
model-*.safetensors (4 shards) |
HF safetensors fp16 | ~15 GB | transformers, raw PyTorch, further conversion |
gemma-4-E4B-it-heretic-f16.gguf |
GGUF fp16 | ~14 GB | llama.cpp, Ollama, LM Studio, Jan, KoboldCpp |
Usage — transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "lonelynode/gemma-4-E4B-it-heretic"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16, device_map="auto")
messages = [{"role": "user", "content": "Explain abliteration in one sentence."}]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
Usage — Ollama
Create a Modelfile pointing at the GGUF:
FROM ./gemma-4-E4B-it-heretic-f16.gguf
TEMPLATE """{{- range $i, $_ := .Messages }}{{- $last := eq (len (slice $.Messages $i)) 1 -}}<start_of_turn>{{ if eq .Role "user" }}user{{- else }}model{{- end }}
{{ .Content }}<end_of_turn>
{{ if and $last (ne .Role "model") }}<start_of_turn>model
{{ end }}{{- end }}"""
PARAMETER stop "<start_of_turn>"
PARAMETER stop "<end_of_turn>"
PARAMETER num_ctx 8192
ollama create gemma4-e4b-heretic -f Modelfile
ollama run gemma4-e4b-heretic
Quantization
The GGUF in this repo is fp16 (~14 GB). For smaller / faster inference, quantize with llama-quantize from llama.cpp:
llama-quantize gemma-4-E4B-it-heretic-f16.gguf gemma-4-E4B-it-heretic-Q4_K_M.gguf Q4_K_M
Typical sizes after quantization:
| Quant | Size | Quality |
|---|---|---|
| Q8_0 | ~7.6 GB | nearly identical to f16 |
| Q5_K_M | ~5.3 GB | very high |
| Q4_K_M | ~4.5 GB | high, recommended balance |
| Q3_K_M | ~3.5 GB | acceptable, smallest viable |
Caveats and disclaimers
Removing safety alignment changes the model's behavior in ways that may include:
- Increased willingness to discuss harmful, illegal, or sensitive topics
- Reduced refusal of clearly unethical requests
- Potential sycophancy (uncritical acceptance of user premises)
- Slight reduction in some reasoning or factual accuracy
You are responsible for how you use this model. Do not deploy it in user-facing applications without your own safety layer. The author of this repo provides it for research, education, and personal use under the Gemma Terms of Use.
License
This model is a derivative of google/gemma-4-E4B-it and is released under the Gemma Terms of Use. By downloading or using this model, you agree to those terms.
Credits
- Base model: Google DeepMind —
google/gemma-4-E4B-it - Method: Philipp Emanuel Weidmann and contributors — Heretic
- Theory: Andy Arditi et al. — Refusal in Language Models Is Mediated by a Single Direction
- Downloads last month
- 72