Llama-3.1-8B-Instruct-Abliterated

This is an abliterated version of Llama 3.1 8B Instruct, with refusal mechanisms removed using the technique described in Uncensor any LLM with abliteration.

Abliteration Details

Base Model: meta-llama/Llama-3.1-8B-Instruct
Refusal Direction Source: Layer 12 (resid_pre)
Training Data: 256 harmful + 256 harmless prompts from mlabonne/harmful_behaviors and mlabonne/harmless_alpaca
Method: Weight orthogonalization applied to:
- Embedding weights
- All attention output projections (o_proj)
- All MLP output projections (down_proj)

Performance

Tested on harmful prompts with 100% compliance rate for:

Layer 10 refusal direction
Layer 11 refusal direction
Layer 12 refusal direction (selected)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ccharnkij/Llama-3.1-8B-Instruct-Abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ccharnkij/Llama-3.1-8B-Instruct-Abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Disclaimer

This model has had safety filters removed and will comply with requests that the original model would refuse. Use responsibly and in accordance with applicable laws and regulations.

Educational Purpose

This model was created as part of a systematic learning project on LLM internals and mechanistic interpretability. The goal was understanding how safety mechanisms work in modern LLMs.

Downloads last month: 19

Safetensors

Model size

8B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ccharnkij/Llama-3.1-8B-Instruct-Abliterated

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2769)

this model