Llama-3.1-8B-Instruct-Abliterated
This is an abliterated version of Llama 3.1 8B Instruct, with refusal mechanisms removed using the technique described in Uncensor any LLM with abliteration.
Abliteration Details
- Base Model: meta-llama/Llama-3.1-8B-Instruct
- Refusal Direction Source: Layer 12 (resid_pre)
- Training Data: 256 harmful + 256 harmless prompts from mlabonne/harmful_behaviors and mlabonne/harmless_alpaca
- Method: Weight orthogonalization applied to:
- Embedding weights
- All attention output projections (o_proj)
- All MLP output projections (down_proj)
Performance
Tested on harmful prompts with 100% compliance rate for:
- Layer 10 refusal direction
- Layer 11 refusal direction
- Layer 12 refusal direction (selected)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"ccharnkij/Llama-3.1-8B-Instruct-Abliterated",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ccharnkij/Llama-3.1-8B-Instruct-Abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Disclaimer
This model has had safety filters removed and will comply with requests that the original model would refuse. Use responsibly and in accordance with applicable laws and regulations.
Educational Purpose
This model was created as part of a systematic learning project on LLM internals and mechanistic interpretability. The goal was understanding how safety mechanisms work in modern LLMs.
- Downloads last month
- 19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for ccharnkij/Llama-3.1-8B-Instruct-Abliterated
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct