File size: 3,248 Bytes
2e1f8d6
 
 
 
 
 
 
 
 
 
 
 
 
d5d8ca8
 
 
 
2e1f8d6
 
d5d8ca8
 
7fd9eec
 
2e1f8d6
247e309
 
2e1f8d6
d5d8ca8
 
80ef501
aead767
 
 
 
 
 
 
 
d5d8ca8
 
2e1f8d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aead767
2e1f8d6
d5d8ca8
2e1f8d6
 
 
 
 
 
 
 
 
 
 
d5d8ca8
 
2e1f8d6
 
 
9349b3e
2e1f8d6
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: llama3.1
train: false
inference: false
pipeline_tag: text-generation
---
This is an <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 4-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct"> Llama3.1-70B-Instruct</a> model.


![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)


## Model Size
| Models            | fp16| HQQ 4-bit/gs-64 | 
|:-------------------:|:--------:|:----------------:|
| Bitrate (Linear layers)    |   16         |  4.5 | 
| VRAM (GB)                  |   140       |  42.7 | 

## Model Decoding Speed
| Models            | fp16| HQQ 4-bit/gs-64| 
|:-------------------:|:--------:|:----------------:|
| Decoding - short seq (tokens/sec)| 10.5  (tokens/sec)**  | 23 (tokens/sec)* |  
| Decoding - long  seq (tokens/sec)|  9.5 (tokens/sec)**  | 19 (tokens/sec)*| 

**: 2xA100 80GB<br>
*:  1xA100 80GB

## Performance

| Models            | fp16 | HQQ 4-bit/gs-64 |
|:-------------------:|:--------:|:--------:|
| ARC (25-shot)      | 70.31 | 70.22 | 
| HellaSwag (10-shot)| 86.40 | 86.39 | 
| MMLU (5-shot)      | 81.84 | 81.04 |
| TruthfulQA-MC2     | 59.83 | 60.39 | 
| Winogrande (5-shot)| 84.85 | 84.53 | 
| GSM8K (5-shot)     | 88.25 | 89.92 | 
| Average            | 78.58 | 78.75 |

You can reproduce the results above via `pip install lm-eval==0.4.3`

## Usage
First, install the dependecies:
```
pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas 
```
Also, make sure you use at least torch `2.4.0` or the nightly build. 

Then you can use the sample code below:
``` Python
import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator

#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3.1-70b-instruct_4bitgs64_hqq'

compute_dtype = torch.bfloat16 #bfloat16 for torchao, float16 for bitblas
cache_dir = '.'
model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
patch_linearlayers(model, patch_add_quant_config, quant_config)

#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="torchao_int4") 
#prepare_for_inference(model, backend="bitblas") #takes a while to init...

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

```