---
license: llama3.1
train: false
inference: false
pipeline_tag: text-generation
---
This is an HQQ all 4-bit (group-size=64) quantized Llama3.1-70B-Instruct model.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
## Model Size
| Models | fp16| HQQ 4-bit/gs-64 |
|:-------------------:|:--------:|:----------------:|
| Bitrate (Linear layers) | 16 | 4.5 |
| VRAM (GB) | 140 | 42.7 |
## Model Decoding Speed
| Models | fp16| HQQ 4-bit/gs-64|
|:-------------------:|:--------:|:----------------:|
| Decoding - short seq (tokens/sec)| 10.5 (tokens/sec)** | 23 (tokens/sec)* |
| Decoding - long seq (tokens/sec)| 9.5 (tokens/sec)** | 19 (tokens/sec)*|
**: 2xA100 80GB
*: 1xA100 80GB
## Performance
| Models | fp16 | HQQ 4-bit/gs-64 |
|:-------------------:|:--------:|:--------:|
| ARC (25-shot) | 70.31 | 70.22 |
| HellaSwag (10-shot)| 86.40 | 86.39 |
| MMLU (5-shot) | 81.84 | 81.04 |
| TruthfulQA-MC2 | 59.83 | 60.39 |
| Winogrande (5-shot)| 84.85 | 84.53 |
| GSM8K (5-shot) | 88.25 | 89.92 |
| Average | 78.58 | 78.75 |
You can reproduce the results above via `pip install lm-eval==0.4.3`
## Usage
First, install the dependecies:
```
pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas
```
Also, make sure you use at least torch `2.4.0` or the nightly build.
Then you can use the sample code below:
``` Python
import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator
#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3.1-70b-instruct_4bitgs64_hqq'
compute_dtype = torch.bfloat16 #bfloat16 for torchao, float16 for bitblas
cache_dir = '.'
model = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
patch_linearlayers(model, patch_add_quant_config, quant_config)
#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="torchao_int4")
#prepare_for_inference(model, backend="bitblas") #takes a while to init...
#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while
gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)
```