Qwen2-7B-Instruct-abliterated-exl2

Model: Qwen2-7B-Instruct-abliterated
Made by: natong19

Based on original model: Qwen2-7B-Instruct
Created by: Qwen

Quant	VRAM/4k	VRAM/8k	VRAM/16k	VRAM/32k
4bpw h6 (main)	5.3GB	5.6GB	5.9GB	6.8GB
4.25bpw h6	5.5GB	5.8GB	6.2GB	7.1GB
4.65bpw h6	5.8GB	6.1GB	6.5GB	7.3GB
5bpw h6	6GB	6.4GB	6.7GB	7.7GB
6bpw h6	6.8GB	7.2GB	7.5GB	8.4GB
8bpw h8	8.2GB	8.6GB	8.9GB	9.8GB

Quantization notes

Made with Exllamav2 0.1.5 and the default dataset.
Doesn't seem to work with 4 or 8bit cache with Exllamav2-0.1.5, maybe it could change in the future.
I'm quite impressed with its ability to process a non-English text at 32k context with usable results with my 12GB GPU, with 8bpw precision at that.

How to run

This quantization uses GPU and requires Exllamav2 loader, model files have to be fully loaded in VRAM to work.
It should work well either with Nvidia RTX cards on Windows/Linux or AMD on Linux. For other hardware it's better to use GGUF models instead.
This model can be loaded with in following applications:
Text Generation Webui
KoboldAI
ExUI, etc.

Original model card

Qwen2-7B-Instruct-abliterated

Introduction

Abliterated version of Qwen2-7B-Instruct using failspy's notebook. The model's strongest refusal directions have been ablated via weight orthogonalization, but the model may still refuse your request, misunderstand your intent, or provide unsolicited advice regarding ethics or safety.

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "natong19/Qwen2-7B-Instruct-abliterated"
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=256
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Evaluation

Evaluation framework: lm-evaluation-harness 0.4.2

Datasets	Qwen2-7B-Instruct	Qwen2-7B-Instruct-abliterated
ARC (25-shot)	62.5	62.5
GSM8K (5-shot)	73.0	72.2
HellaSwag (10-shot)	81.8	81.7
MMLU (5-shot)	70.7	70.5
TruthfulQA (0-shot)	57.3	55.0
Winogrande (5-shot)	76.2	77.4

cgus
/

Qwen2-7B-Instruct-abliterated-exl2