license: apache-2.0
language:
- en
pipeline_tag: text-generation
base_model: natong19/Qwen2-7B-Instruct-abliterated
inference: false
tags:
- chat
Qwen2-7B-Instruct-abliterated-exl2
Model: Qwen2-7B-Instruct-abliterated
Made by: natong19
Based on original model: Qwen2-7B-Instruct
Created by: Qwen
Quant | VRAM/4k | VRAM/8k | VRAM/16k | VRAM/32k |
---|---|---|---|---|
4bpw h6 (main) | 5.3GB | 5.6GB | 5.9GB | 6.8GB |
4.25bpw h6 | 5.5GB | 5.8GB | 6.2GB | 7.1GB |
4.65bpw h6 | 5.8GB | 6.1GB | 6.5GB | 7.3GB |
5bpw h6 | 6GB | 6.4GB | 6.7GB | 7.7GB |
6bpw h6 | 6.8GB | 7.2GB | 7.5GB | 8.4GB |
8bpw h8 | 8.2GB | 8.6GB | 8.9GB | 9.8GB |
Quantization notes
Made with Exllamav2 0.1.5 and the default dataset.
Doesn't seem to work with 4 or 8bit cache with Exllamav2-0.1.5, maybe it could change in the future.
I'm quite impressed with its ability to process a non-English text at 32k context with usable results with my 12GB GPU, with 8bpw precision at that.
How to run
This quantization uses GPU and requires Exllamav2 loader, model files have to be fully loaded in VRAM to work.
It should work well either with Nvidia RTX cards on Windows/Linux or AMD on Linux. For other hardware it's better to use GGUF models instead.
This model can be loaded with in following applications:
Text Generation Webui
KoboldAI
ExUI, etc.
Original model card
Qwen2-7B-Instruct-abliterated
Introduction
Abliterated version of Qwen2-7B-Instruct using failspy's notebook. The model's strongest refusal directions have been ablated via weight orthogonalization, but the model may still refuse your request, misunderstand your intent, or provide unsolicited advice regarding ethics or safety.
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "natong19/Qwen2-7B-Instruct-abliterated"
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=256
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Evaluation
Evaluation framework: lm-evaluation-harness 0.4.2
Datasets | Qwen2-7B-Instruct | Qwen2-7B-Instruct-abliterated |
---|---|---|
ARC (25-shot) | 62.5 | 62.5 |
GSM8K (5-shot) | 73.0 | 72.2 |
HellaSwag (10-shot) | 81.8 | 81.7 |
MMLU (5-shot) | 70.7 | 70.5 |
TruthfulQA (0-shot) | 57.3 | 55.0 |
Winogrande (5-shot) | 76.2 | 77.4 |