theo77186
/

Llama-3-70B-Instruct-norefusal

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3-70B-Instruct-norefusal / README.md

theo77186's picture

Update README.md

f363fd8 verified 6 months ago

|

history blame contribute delete

753 Bytes

	---
	license: llama3
	---
	# Llama 3 70B Instruct no refusal

	This is a model that uses the orthogonal feature ablation as featured in this
	[paper](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction).

	Calibration data:
	- 256 prompts from [jondurbin/airoboros-2.2](https://huggingface.co/datasets/jondurbin/airoboros-2.2)
	- 256 prompts from [AdvBench](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv)
	- The direction is extracted between layer 40 and 41

	I haven't tested the model but like the 8B model, may still refuse some instructions.
	Use this model responsibly, I decline any liability resulting of the use of this model.

	I will post the code later.