---
license: llama2
train: false
inference: false
pipeline_tag: text-generation
---

This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> 4-bit quantized <a href="https://huggingface.co/meta-llama/Llama-2-7b-chat-hf"> Llama2-7B-chat model </a> <i>without grouping</i> using a low-rank adapter to improve the performance (referred to as <a href="https://mobiusml.github.io/1bit_blog/">HQQ+</a>).  
This model doesn't use grouping to make it compatible with the fast <a href="https://github.com/IST-DASLab/marlin/tree/master/marlin">Marlin</a> inference kernel.

Running quantized models efficiently for inference requires using fused matrix-vector multiplications. The kernels available now have some constraints on the choice of the group-size and the axis along-which quantization is performed.
This model doesn't use grouping to make it compatible with all the kernels that operate along `axis=1`.

## Performance

| Models            | Llama2-7B-chat (fp16)| Llama2-7B-chat (HQQ+ 4-bit/no-gs)| 
|-------------------|------------------|------------------|
| ARC (25-shot)     |    53.67         |  48.46           |
| HellaSwag (10-shot)|   78.56         |  73.33           |
| MMLU (5-shot)     |    48.16         |  44.87           |
| TruthfulQA-MC2    |    45.32         |  43.27           |
| Winogrande (5-shot)|   72.53         |  71.67           |
| GSM8K (5-shot)    |    23.12         |  27.82           |
| Average           |    53.56         |  51.57           |

## Usage
First, install the latest version of <a href="https://github.com/mobiusml/hqq/">HQQ</a>:
```
pip install git+https://github.com/mobiusml/hqq.git
pip install git+https://github.com/IST-DASLab/marlin.git #to use the marlin backend
```
Make sure you use `pip install transformers==4.39.0`

Then you can use the sample code below:
``` Python
import torch, os

os.environ["TOKENIZERS_PARALLELISM"]  = "1"
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32       = True

import torch
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
from hqq.core.quantize import *
from hqq.utils.patching import *

#Load the model
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq' 
model     = HQQModelForCausalLM.from_quantized(model_id, cache_dir='.', compute_dtype=torch.float16, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)

patch_linearlayers(model, patch_add_quant_config, 
                          BaseQuantizeConfig(nbits=4, group_size=None, quant_scale=False, quant_zero=False, axis=1))

HQQLinear.set_backend(HQQBackend.PYTORCH)
model.eval();

#Use optimized inference kernels
from hqq.utils.patching import prepare_for_inference
#prepare_for_inference(model) #default
#prepare_for_inference(model, backend="torchao_int4") #use bfloat16
prepare_for_inference(model, backend="marlin", allow_merge=True) #use float16

#Generate
from hqq.utils.generation_hf import HFGenerator
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial") 

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

```