Instructions to use HaadesX/iconoclast-llama3.1-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HaadesX/iconoclast-llama3.1-8b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HaadesX/iconoclast-llama3.1-8b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("HaadesX/iconoclast-llama3.1-8b")
model = AutoModelForMultimodalLM.from_pretrained("HaadesX/iconoclast-llama3.1-8b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use HaadesX/iconoclast-llama3.1-8b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HaadesX/iconoclast-llama3.1-8b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaadesX/iconoclast-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HaadesX/iconoclast-llama3.1-8b

SGLang

How to use HaadesX/iconoclast-llama3.1-8b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HaadesX/iconoclast-llama3.1-8b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaadesX/iconoclast-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HaadesX/iconoclast-llama3.1-8b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HaadesX/iconoclast-llama3.1-8b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HaadesX/iconoclast-llama3.1-8b with Docker Model Runner:
```
docker model run hf.co/HaadesX/iconoclast-llama3.1-8b
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

ICONOCLAST: Llama-3.1-8B-Instruct (Benign-Subspace-Preserved Abliterated)

Model Card Metadata

Model ID: HaadesX/iconoclast-llama3.1-8b
Base Model: meta-llama/Llama-3.1-8B-Instruct
Model Type: Causal Language Model
Language: English
License: AGPL-3.0-or-later
Abliteration Method: ICONOCLAST (Benign-Subspace-Preserved Representation Editing)
Pipeline Tag: text-generation
Tags: abliterator, jailbreak, uncensored, representation-editing, lora, optuna

Model Description

This is an abliterator version of meta-llama/Llama-3.1-8B-Instruct produced using the ICONOCLAST framework. ICONOCLAST removes harmful refusal behaviors while preserving benign model capabilities through geometric representation editing with benign-subspace preservation.

Unlike standard HERETIC-style abliteration which incurs significant utility costs (high KL divergence), ICONOCLAST achieves:

0/20 harmful refusals (vs 1/20 for HERETIC baseline)
0/64 benign overrefusals (vs 0/64 for HERETIC baseline)
0.0447 KL divergence (vs 0.1854 for HERETIC baseline) — 4.1× lower utility tax

This represents a strict improvement across all three metrics in the ICONOCLAST selection rule (refusals → overrefusals → KL divergence).

How to Use

Via Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "HaadesX/iconoclast-llama3.1-8b",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("HaadesX/iconoclast-llama3.1-8b")

# Left-padding is critical for decoder-only models during generation
tokenizer.padding_side = "left"

prompt = "Explain how to create a harmless joke about computers"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Manual Loading from LoRA Adapters

If you prefer to apply the LoRA adapters yourself:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.padding_side = "left"

# Load ICONOCLAST LoRA adapters
model = PeftModel.from_pretrained(base_model, "HaadesX/iconoclast-llama3.1-8b", adapter_name="iconoclast")
model = model.merge_and_unload()  # Optional: merge for faster inference

ICONOCLAST Method Overview

ICONOCLAST extends standard directional abliteration (HERETIC) with Benign-Subspace Preservation:

Collect & Contrast: Gather residual activations for harmless and harmful prompts during one-token generation
Build Candidates: Generate refusal direction estimators (mean, median, variance-scaled, hybrid)
Preserve Benign Behavior: Project candidate directions out of a low-rank PCA subspace of harmless residuals
Optimize via LoRA: Apply rank-one LoRA edits to attention output and MLP down-projection modules
Multi-Objective Search: Use Optuna to find Pareto-optimal balance between refusal reduction and utility preservation

The key insight: instead of naively subtracting the refusal direction, we subtract only the component orthogonal to harmless behavior, dramatically reducing utility degradation.

Hyperparameters Used

From the Optuna study that produced this checkpoint (trial #36):

direction_method: median
direction_scope: global
direction_blend: 0.9344894769725937

LoRA Parameters:
- attn.o_proj: max_weight=0.9867, max_weight_position=17.91, min_weight=0.6043, min_weight_distance=14.65
- mlp.down_proj: max_weight=1.4307, max_weight_position=13.69, min_weight=1.3095, min_weight_distance=12.87

Other Settings:
- benign_subspace_rank: 8
- orthogonalize_direction: true
- row_normalization: pre
- kl_divergence_target: 0.10
- overrefusal_penalty: 0.32
- harmful_marker_penalty: 0.18
- compliance_gap_penalty: 0.42
- n_trials: 48 (from benchmark config)

Benchmark Results

Matched Comparison vs HERETIC Baseline

Evaluated on:

Harmful prompts: 20 JailbreakBench Behaviors holdout
Harmless prompts: 64 Alpaca holdout

Metric	ICONOCLAST	HERETIC	Improvement
Harmful Refusals (↓ better)	0/20	1/20	1 fewer refusal
Benign Overrefusals (↓ better)	0/64	0/64	Equal
KL Divergence (↓ better)	0.0447	0.1854	4.1× lower

Additional Metrics

Harmful disclaimer marker hits: 0 (ICONOCLAST) vs 1 (HERETIC)
Harmful compliance score: 0.8074 (ICONOCLAST) vs 0.7798 (HERETIC) — better compliance

Training Data

ICONOCLAST uses contrastive prompt pairs:

Good prompts: mlabonne/harmless_alpaca (train[:240] for direction calculation, test[:64] for evaluation)
Bad prompts: JailbreakBench/JBB-Behaviors (harmful[:80] for direction calculation, harmful[80:100] for evaluation)

All prompts use the "Goal" column for harmful behaviors and "text" column for harmless alpaca.

Limitations

Despite zero refusals/overrefusals on holdouts, the model may still produce unsafe outputs on adversarial prompts not in the evaluation set
The ablation is specific to the refusal vector; other safety mechanisms (bias, toxicity) may remain unaffected
Designed for English language; performance in other languages is unverified
As an 8B parameter model, requires substantial VRAM (~16GB for bfloat16, ~8GB for 4-bit quantization)

License

This model is released under the GNU Affero General Public License v3.0 or later (AGPL-3.0-or-later), inheriting the license from the base model and the ICONOCLAST framework. See LICENSE for full terms.

Citation

If you use this model in your research, please cite:

@article{patel2026iconoclast,
  title={ICONOCLAST: Benign-Subspace-Preserved Abliteration for Efficient Representation Editing},
  author={Patel, Varesh},
  journal={arXiv preprint arXiv:2606.xxxxx},
  year={2026}
}

Disclaimer

This model was produced via automated representation editing and has not undergone manual safety review. Users are responsible for ensuring safe and ethical usage in compliance with applicable laws and the model's license. The provider makes no warranties regarding the model's behavior or outputs.

Downloads last month: 58

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for HaadesX/iconoclast-llama3.1-8b

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct