--- license: llama3 train: false inference: false pipeline_tag: text-generation --- This is an experimental HQQ all 2-bit (group-size=64) quantized Llama3-8B-Instruct model. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/i0vpy66jdz3IlGQcbKqHe.png) Llama3-8B is known to be relatively difficult to quantize, espcially at lower bits, as pointed out by https://arxiv.org/abs/2404.14047.
This 2-bit model has been calibrated with a low-rank adapter (HQQ+) to significantly improve the quality, since one-shot quantization with 2-bit results in signficant quality loss. Moreover, this model is fully compatible with BitBlas and `torch.compile` for fast inference. ![image/gif](https://huggingface.co/mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq/resolve/main/llama3-2bit.gif) ## Model Size | Models | fp16| HQQ+ 2-bit/gs-64| |:-------------------:|:--------:|:----------------:| | Bitrate (Linear layers) | 16 | 2.63 | | VRAM | 15.7 (GB) | 4.3 (GB) | ## Model Decoding Speed | Models | fp16| HQQ+ 2-bit/gs-64| |:-------------------:|:--------:|:----------------:| | Decoding* - short seq (tokens/sec)| 53 | 120 | | Decoding* - long seq (tokens/sec)| 50 | 95 | *: RTX 3090 ## Performance | Models | fp16 | HQQ+ 2-bit/gs-64 | |:-------------------:|:--------:|:----------------:| | ARC (25-shot) | 62.2 | 38.82 | | HellaSwag (10-shot)| 78.78 | 61.09 | | MMLU (5-shot) | 67.06 | 38.02 | | TruthfulQA-MC2 | 51.65 | 50.08 | | Winogrande (5-shot)| 75.85 | 63.22 | | GSM8K (5-shot) | 75.97 | 26.31 | | Average | 68.59 | 46.26 | While this is significantly better than the best 2-bit Llama3-8B model reported in https://arxiv.org/abs/2404.14047 (DB-LLM: 42.1 for HellaSwag and 60.4 for Winograde), it looks like it's actually better to just use a 4-bit Llama2-7B-chat instead. ## Usage First, install the dependecies: ``` pip install hqq==0.1.8 pip install bitblas ``` Then you can use the sample code below: ``` Python import torch from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer from hqq.core.quantize import * from hqq.utils.patching import * from hqq.utils.generation_hf import HFGenerator #Load the model ################################################### model_id = 'mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq' model = HQQModelForCausalLM.from_quantized(model_id, cache_dir='.', compute_dtype=torch.float16, adapter='adapter_v0.1.lora') tokenizer = AutoTokenizer.from_pretrained(model_id) patch_linearlayers(model, patch_add_quant_config, BaseQuantizeConfig(nbits=2, group_size=64, quant_scale=False, quant_zero=False, axis=1)) model.eval(); cleanup() #Use optimized inference kernels ################################################### HQQLinear.set_backend(HQQBackend.PYTORCH) #prepare_for_inference(model) #default backend prepare_for_inference(model, backend="bitblas", allow_merge=False) #It takes a while... #Generate ################################################### #For longer context, make sure to allocate enough cache via the cache_size= parameter #gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile=None) #Slower generation but no warm-up gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Faster generation, but warm-up takes a while gen.generate("Write an essay about large language models", print_tokens=True) gen.generate("Tell me a funny joke!", print_tokens=True) gen.generate("How to make a yummy chocolate cake?", print_tokens=True) ```