--- license: apache-2.0 tags: - moe train: false inference: false pipeline_tag: text-generation --- ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ This is a version of the Mixtral-8x7B-Instruct-v0.1 model quantized with a mix of 4-bit and 3-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 3-bit. Contrary to the 2bitgs8 model that was designed to use less GPU memory, this one uses about ~22GB for the folks who want to get better quality and use the maximum VRAM available on 24GB GPUs. It reaches an impressive 71.10 LLM leaderboard score, not too far from the original model's 72.62. ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif) ----------------------------------------------------------------------------------------------------------------------------------

## Performance | Models | Mixtral Original | HQQ quantized | |-------------------|------------------|------------------| | Runtime VRAM | 94 GB | 22.3 GB | | ARC (25-shot) | 70.22 | 69.62 | | Hellaswag (10-shot)| 87.63 | 86.05 | | MMLU (5-shot) | 71.16 | 69.46 | | TruthfulQA-MC2 | 64.58 | 62.63 | | Winogrande (5-shot)| 81.37 | 81.06 | | GSM8K (5-shot)| 60.73 | 57.77 | | Average| 72.62 | 71.10 | ### Basic Usage To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows: ``` Python import transformers from threading import Thread model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ' #Load the model from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_id) model = HQQModelForCausalLM.from_quantized(model_id) #Optional: set backend/compile #You will need to install CUDA kernels apriori # git clone https://github.com/mobiusml/hqq/ # cd hqq/kernels && python setup_cuda.py install from hqq.core.quantize import * HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP) def chat_processor(chat, max_new_tokens=100, do_sample=True): tokenizer.use_default_system_prompt = False streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True) generate_params = dict( tokenizer(" [INST] " + chat + " [/INST] ", return_tensors="pt").to('cuda'), streamer=streamer, max_new_tokens=max_new_tokens, do_sample=do_sample, top_p=0.90, top_k=50, temperature= 0.6, num_beams=1, repetition_penalty=1.2, ) t = Thread(target=model.generate, kwargs=generate_params) t.start() outputs = [] for text in streamer: outputs.append(text) print(text, end="", flush=True) return outputs ################################################################################################ #Generation outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False) ``` ### Quantization You can reproduce the model using the following quant configs: ``` Python from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1" model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path) #Quantize params from hqq.core.quantize import * attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True) experts_params = BaseQuantizeConfig(nbits=3, group_size=64, offload_meta=True) zero_scale_group_size = 128 attn_prams['scale_quant_params']['group_size'] = zero_scale_group_size attn_prams['zero_quant_params']['group_size'] = zero_scale_group_size experts_params['scale_quant_params']['group_size'] = zero_scale_group_size experts_params['zero_quant_params']['group_size'] = zero_scale_group_size quant_config = {} #Attention quant_config['self_attn.q_proj'] = attn_prams quant_config['self_attn.k_proj'] = attn_prams quant_config['self_attn.v_proj'] = attn_prams quant_config['self_attn.o_proj'] = attn_prams #Experts quant_config['block_sparse_moe.experts.w1'] = experts_params quant_config['block_sparse_moe.experts.w2'] = experts_params quant_config['block_sparse_moe.experts.w3'] = experts_params #Quantize model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16); model.eval(); ```