Text Generation
Transformers
File size: 4,127 Bytes
61e3950
 
 
 
 
 
0acb969
 
533fcd6
 
0ea99de
0acb969
 
61e3950
 
 
0acb969
 
 
6e7e7ca
25d87d5
0acb969
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61e3950
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3c2810
61e3950
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: llama3
train: false
inference: false
pipeline_tag: text-generation
---
This is an experimental <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 2-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct"> Llama3-8B-Instruct</a> model.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)

Llama3-8B is known to be relatively difficult to quantize, espcially at lower bits, as pointed out by https://arxiv.org/abs/2404.14047.<br>
This 2-bit model has been calibrated with a low-rank adapter (HQQ+) to significantly improve the quality, since one-shot quantization with 2-bit results in signficant quality loss. 
Moreover, this model is fully compatible with <a href="https://github.com/microsoft/BitBLAS"> BitBlas </a> and `torch.compile` for fast inference.

![image/gif](https://huggingface.co/mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq/resolve/main/llama3-2bit.gif)

## Model Size
| Models            | fp16| HQQ+ 2-bit/gs-64| 
|:-------------------:|:--------:|:----------------:|
| Bitrate (Linear layers)    |   16           |  2.63           |
| VRAM    |   15.7 (GB)       |  4.3 (GB)          |

## Model Decoding Speed
| Models            | fp16| HQQ+ 2-bit/gs-64| 
|:-------------------:|:--------:|:----------------:|
| Decoding* - short seq (tokens/sec)|  53            |    120     |
| Decoding* - long  seq (tokens/sec)|  50            |    95      |

*: RTX 3090

## Performance

| Models            | fp16             | HQQ+ 2-bit/gs-64 |
|:-------------------:|:--------:|:----------------:|
| ARC (25-shot)     |   62.2           |  38.82           |
| HellaSwag (10-shot)|  78.78          |  61.09           |
| MMLU (5-shot)     |   67.06          |  38.02           |
| TruthfulQA-MC2    |   51.65          |  50.08           |
| Winogrande (5-shot)|  75.85          |  63.22           |
| GSM8K (5-shot)    |   75.97          |  26.31           |
| Average           |   68.59          |  46.26           |

While this is significantly better than the best 2-bit Llama3-8B model reported in https://arxiv.org/abs/2404.14047 (DB-LLM: 42.1 for HellaSwag and 60.4 for Winograde), it looks like it's actually better to just use a <a href="https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq">4-bit Llama2-7B-chat </a> instead.

## Usage
First, install the dependecies:
```
pip install hqq==0.1.8
pip install bitblas
```

Then you can use the sample code below:
``` Python
import torch
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
from hqq.core.quantize import *
from hqq.utils.patching import *
from hqq.utils.generation_hf import HFGenerator

#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq' 
model     = HQQModelForCausalLM.from_quantized(model_id, cache_dir='.', compute_dtype=torch.float16, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)

patch_linearlayers(model, patch_add_quant_config, 
                          BaseQuantizeConfig(nbits=2, group_size=64, quant_scale=False, quant_zero=False, axis=1))

model.eval();
cleanup()

#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="bitblas", allow_merge=False) #It takes a while...

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
#gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile=None) #Slower generation but no warm-up 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Faster generation, but warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

```