File size: 4,127 Bytes
61e3950 0acb969 533fcd6 0ea99de 0acb969 61e3950 0acb969 6e7e7ca 25d87d5 0acb969 61e3950 e3c2810 61e3950 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
---
license: llama3
train: false
inference: false
pipeline_tag: text-generation
---
This is an experimental <a href="https://github.com/mobiusml/hqq/">HQQ</a> all 2-bit (group-size=64) quantized <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct"> Llama3-8B-Instruct</a> model.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png)
Llama3-8B is known to be relatively difficult to quantize, espcially at lower bits, as pointed out by https://arxiv.org/abs/2404.14047.<br>
This 2-bit model has been calibrated with a low-rank adapter (HQQ+) to significantly improve the quality, since one-shot quantization with 2-bit results in signficant quality loss.
Moreover, this model is fully compatible with <a href="https://github.com/microsoft/BitBLAS"> BitBlas </a> and `torch.compile` for fast inference.
![image/gif](https://huggingface.co/mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq/resolve/main/llama3-2bit.gif)
## Model Size
| Models | fp16| HQQ+ 2-bit/gs-64|
|:-------------------:|:--------:|:----------------:|
| Bitrate (Linear layers) | 16 | 2.63 |
| VRAM | 15.7 (GB) | 4.3 (GB) |
## Model Decoding Speed
| Models | fp16| HQQ+ 2-bit/gs-64|
|:-------------------:|:--------:|:----------------:|
| Decoding* - short seq (tokens/sec)| 53 | 120 |
| Decoding* - long seq (tokens/sec)| 50 | 95 |
*: RTX 3090
## Performance
| Models | fp16 | HQQ+ 2-bit/gs-64 |
|:-------------------:|:--------:|:----------------:|
| ARC (25-shot) | 62.2 | 38.82 |
| HellaSwag (10-shot)| 78.78 | 61.09 |
| MMLU (5-shot) | 67.06 | 38.02 |
| TruthfulQA-MC2 | 51.65 | 50.08 |
| Winogrande (5-shot)| 75.85 | 63.22 |
| GSM8K (5-shot) | 75.97 | 26.31 |
| Average | 68.59 | 46.26 |
While this is significantly better than the best 2-bit Llama3-8B model reported in https://arxiv.org/abs/2404.14047 (DB-LLM: 42.1 for HellaSwag and 60.4 for Winograde), it looks like it's actually better to just use a <a href="https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq">4-bit Llama2-7B-chat </a> instead.
## Usage
First, install the dependecies:
```
pip install hqq==0.1.8
pip install bitblas
```
Then you can use the sample code below:
``` Python
import torch
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
from hqq.core.quantize import *
from hqq.utils.patching import *
from hqq.utils.generation_hf import HFGenerator
#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq'
model = HQQModelForCausalLM.from_quantized(model_id, cache_dir='.', compute_dtype=torch.float16, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)
patch_linearlayers(model, patch_add_quant_config,
BaseQuantizeConfig(nbits=2, group_size=64, quant_scale=False, quant_zero=False, axis=1))
model.eval();
cleanup()
#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="bitblas", allow_merge=False) #It takes a while...
#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter
#gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile=None) #Slower generation but no warm-up
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Faster generation, but warm-up takes a while
gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)
```
|