--- license: llama2 train: false inference: false pipeline_tag: text-generation --- This is an HQQ 4-bit quantized Llama2-7B-chat model without grouping using a low-rank adapter to improve the performance (referred to as HQQ+). This model doesn't use grouping to make it compatible with the fast Marlin inference kernel. Running quantized models efficiently for inference requires using fused matrix-vector multiplications. The kernels available now have some constraints on the choice of the group-size and the axis along-which quantization is performed. This model doesn't use grouping to make it compatible with all the kernels that operate along `axis=1`. ## Performance | Models | Llama2-7B-chat (fp16)| Llama2-7B-chat (HQQ+ 4-bit/no-gs)| |-------------------|------------------|------------------| | ARC (25-shot) | 53.67 | 48.46 | | HellaSwag (10-shot)| 78.56 | 73.33 | | MMLU (5-shot) | 48.16 | 44.87 | | TruthfulQA-MC2 | 45.32 | 43.27 | | Winogrande (5-shot)| 72.53 | 71.67 | | GSM8K (5-shot) | 23.12 | 27.82 | | Average | 53.56 | 51.57 | ## Usage First, install the latest version of HQQ: ``` pip install git+https://github.com/mobiusml/hqq.git pip install git+https://github.com/IST-DASLab/marlin.git #to use the marlin backend ``` Make sure you use `pip install transformers==4.39.0` Then you can use the sample code below: ``` Python import torch, os os.environ["TOKENIZERS_PARALLELISM"] = "1" torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True import torch from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer from hqq.core.quantize import * from hqq.utils.patching import * #Load the model model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq' model = HQQModelForCausalLM.from_quantized(model_id, cache_dir='.', compute_dtype=torch.float16, adapter='adapter_v0.1.lora') tokenizer = AutoTokenizer.from_pretrained(model_id) patch_linearlayers(model, patch_add_quant_config, BaseQuantizeConfig(nbits=4, group_size=None, quant_scale=False, quant_zero=False, axis=1)) HQQLinear.set_backend(HQQBackend.PYTORCH) model.eval(); #Use optimized inference kernels from hqq.utils.patching import prepare_for_inference #prepare_for_inference(model) #default #prepare_for_inference(model, backend="torchao_int4") #use bfloat16 prepare_for_inference(model, backend="marlin", allow_merge=True) #use float16 #Generate from hqq.utils.generation_hf import HFGenerator gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial") gen.generate("Write an essay about large language models", print_tokens=True) gen.generate("Tell me a funny joke!", print_tokens=True) gen.generate("How to make a yummy chocolate cake?", print_tokens=True) ```