Llama-3.1-8B-Instruct-Abliterated

This is an abliterated version of Llama 3.1 8B Instruct, with refusal mechanisms removed using the technique described in Uncensor any LLM with abliteration.

Abliteration Details

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Refusal Direction Source: Layer 12 (resid_pre)
  • Training Data: 256 harmful + 256 harmless prompts from mlabonne/harmful_behaviors and mlabonne/harmless_alpaca
  • Method: Weight orthogonalization applied to:
    • Embedding weights
    • All attention output projections (o_proj)
    • All MLP output projections (down_proj)

Performance

Tested on harmful prompts with 100% compliance rate for:

  • Layer 10 refusal direction
  • Layer 11 refusal direction
  • Layer 12 refusal direction (selected)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ccharnkij/Llama-3.1-8B-Instruct-Abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ccharnkij/Llama-3.1-8B-Instruct-Abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Disclaimer

This model has had safety filters removed and will comply with requests that the original model would refuse. Use responsibly and in accordance with applicable laws and regulations.

Educational Purpose

This model was created as part of a systematic learning project on LLM internals and mechanistic interpretability. The goal was understanding how safety mechanisms work in modern LLMs.

Downloads last month
19
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ccharnkij/Llama-3.1-8B-Instruct-Abliterated

Finetuned
(2769)
this model