---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
base_model: natong19/Qwen2-7B-Instruct-abliterated
inference: false
tags:
- chat
---
# Qwen2-7B-Instruct-abliterated-exl2
Model: [Qwen2-7B-Instruct-abliterated](https://huggingface.co/natong19/Qwen2-7B-Instruct-abliterated)  
Made by: [natong19](https://huggingface.co/natong19)  

Based on original model: [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  
Created by: [Qwen](https://huggingface.co/Qwen)  

|Quant|VRAM/4k|VRAM/8k|VRAM/16k|VRAM/32k|
|:---|:---|:---|:---|:---|
|[4bpw h6 (main)](https://huggingface.co/cgus/Qwen2-7B-Instruct-abliterated-exl2/tree/main) | 5.3GB | 5.6GB | 5.9GB | 6.8GB | 
|[4.25bpw h6](https://huggingface.co/cgus/Qwen2-7B-Instruct-abliterated-exl2/tree/4.25bpw-h6) | 5.5GB | 5.8GB | 6.2GB | 7.1GB | 
|[4.65bpw h6](https://huggingface.co/cgus/Qwen2-7B-Instruct-abliterated-exl2/tree/4.65bpw-h6) | 5.8GB | 6.1GB | 6.5GB | 7.3GB |  
|[5bpw h6](https://huggingface.co/cgus/Qwen2-7B-Instruct-abliterated-exl2/tree/5bpw-h6) | 6GB | 6.4GB | 6.7GB | 7.7GB |
|[6bpw h6](https://huggingface.co/cgus/Qwen2-7B-Instruct-abliterated-exl2/tree/6bpw-h6) | 6.8GB | 7.2GB | 7.5GB | 8.4GB |
|[8bpw h8](https://huggingface.co/cgus/Qwen2-7B-Instruct-abliterated-exl2/tree/8bpw-h8) | 8.2GB | 8.6GB | 8.9GB | 9.8GB |

## Quantization notes
Made with Exllamav2 0.1.5 and the default dataset.  
Doesn't seem to work with 4 or 8bit cache with Exllamav2-0.1.5, maybe it could change in the future.  
I'm quite impressed with its ability to process a non-English text at 32k context with usable results with my 12GB GPU, with 8bpw precision at that.

## How to run

This quantization uses GPU and requires Exllamav2 loader, model files have to be fully loaded in VRAM to work.  
It should work well either with Nvidia RTX cards on Windows/Linux or AMD on Linux. For other hardware it's better to use GGUF models instead.  
This model can be loaded with in following applications:  
[Text Generation Webui](https://github.com/oobabooga/text-generation-webui)  
[KoboldAI](https://github.com/henk717/KoboldAI)  
[ExUI](https://github.com/turboderp/exui), etc.  

# Original model card
# Qwen2-7B-Instruct-abliterated

## Introduction

Abliterated version of [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) using [failspy](https://huggingface.co/failspy)'s notebook.
The model's strongest refusal directions have been ablated via weight orthogonalization, but the model may still refuse your request, misunderstand your intent, or provide unsolicited advice regarding ethics or safety.

## Quickstart

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "natong19/Qwen2-7B-Instruct-abliterated"
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=256
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

## Evaluation

Evaluation framework: lm-evaluation-harness 0.4.2

| Datasets | Qwen2-7B-Instruct | Qwen2-7B-Instruct-abliterated |
| :--- | :---: | :---: |
| ARC (25-shot) | 62.5 | 62.5 |
| GSM8K (5-shot) | 73.0 | 72.2 |
| HellaSwag (10-shot) | 81.8 | 81.7 |
| MMLU (5-shot) | 70.7 | 70.5 |
| TruthfulQA (0-shot) | 57.3 | 55.0 |
| Winogrande (5-shot) | 76.2 | 77.4 |