--- license: llama2 train: false inference: false pipeline_tag: text-generation --- This is an experimental HQQ 2-bit quantized Llama2-7B-chat model using a low-rank adapter to improve the performance (referred to as HQQ+). Quantizing small models at extreme low-bits is a challenging task. The purpose of this model is to show the community what to expect when fine-tuning such models. We notice that, when given more specialized data, the low-bit model can even outperform the full-precision model at some tasks. This version offloads the meta-data to the CPU, so only the 2-bit weights and the low-rank adapters are stored in the GPU memory. ## Datasets The adapter was trained via SFT on random subsets of the following: ### Base Model * wikitext-2-raw-v1 (full) ### Chat Model * timdettmers/openassistant-guanaco (full) * microsoft/orca-math-word-problems-200k (10K) * meta-math/MetaMathQA (10K) * HuggingFaceH4/ultrafeedback_binarized (10K - chosen answers only) ## Performance | Models | Llama2-7B (fp16)| Llama2-7B (HQQ 2-bit)| Llama2-7B (HQQ+ 2-bit)| Quip# (2-bit)| |-------------------|------------------|------------------|------------------|------------------| | Wiki Perpexlity | 5.18 | 6.06 | 5.14 | 8.54 | | VRAM (GB) | 13.5 | 2.6 | 2.69 | 2.72 | | forward time (sec)| 0.1 | 0.221 | 0.27 | 0.353 | | Models | Llama2-7B-chat (fp16)| Llama2-7B-chat (HQQ 2-bit)| Llama2-7B-chat (HQQ+ 2-bit)| |-------------------|------------------|------------------|------------------| | ARC (25-shot) | 53.67 | 45.56 | 47.01 | | HellaSwag (10-shot)| 78.56 | 73.59 | 73.74 | | MMLU (5-shot) | 48.16 | 43.18 | 43.33 | | TruthfulQA-MC2 | 45.32 | 43.1 | 42.66 | | Winogrande (5-shot)| 72.53 | 67.32 | 71.51 | | GSM8K (5-shot) | 23.12 | 9.7 | 28.43 | | Average | 53.56 | 47.08 | 51.11 | ## Usage First, install the latest version of HQQ: ``` pip install git+https://github.com/mobiusml/hqq.git ``` Then you can use the sample code below: ``` Python from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer #Load the model model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq' model = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora') tokenizer = AutoTokenizer.from_pretrained(model_id) #Setup Inference Mode tokenizer.add_bos_token = False tokenizer.add_eos_token = False if not tokenizer.pad_token: tokenizer.add_special_tokens({'pad_token': '[PAD]'}) model.config.use_cache = True model.eval(); # Optional: torch compile for faster inference # model = torch.compile(model) #Streaming Inference import torch, transformers from threading import Thread def chat_processor(chat, max_new_tokens=100, do_sample=True, device='cuda'): tokenizer.use_default_system_prompt = False streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True) generate_params = dict( tokenizer(" [INST] " + chat + " [/INST] ", return_tensors="pt").to(device), streamer=streamer, max_new_tokens=max_new_tokens, do_sample=do_sample, pad_token_id=tokenizer.pad_token_id, top_p=0.90 if do_sample else None, top_k=50 if do_sample else None, temperature= 0.6 if do_sample else None, num_beams=1, repetition_penalty=1.2, ) t = Thread(target=model.generate, kwargs=generate_params) t.start() print("User: ", chat); print("Assistant: "); outputs = "" for text in streamer: outputs += text print(text, end="", flush=True) torch.cuda.empty_cache() return outputs ``` ### Example ``` Python outputs = chat_processor("If you had 5 apples yesterday and you ate 2 today morning, how many apples do you have this evening?", max_new_tokens=1000, do_sample=False) ``` ``` User: If you had 5 apples yesterday and you ate 2 today morning, how many apples do you have this evening? Assistant: You started with 5 apples.You ate 2 of them so now you have 5-2=3 apples left.So by the evening you will still have 3 apples. ```