README.md · mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ at cbbc082fb760cb4e557b1722c3cc2a68a5280dfc

metadata

license: apache-2.0
tags:
  - moe
train: false
inference: false
pipeline_tag: text-generation

Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ

This is a version of the Mixtral-8x7B-Instruct-v0.1 model (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).

More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.

The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!

Performance

Models	Mixtral Original	HQQ quantized
Runtime VRAM	90 GB	13 GB
ARC (25-shot)	70.22	66.47
Hellaswag (10-shot)	87.63	84.78
MMLU (5-shot)	71.16	67.35
TruthfulQA-MC2	64.58	62.85
Winogrande (5-shot)	81.37	79.40
GSM8K (5-shot)	60.73	45.86
Average	72.62	67.79

Screencast

Here is a small screencast of the model running on RTX 4090

Basic Usage

To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:

import transformers 
from threading import Thread

model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ'
#Load the model
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = HQQModelForCausalLM.from_quantized(model_id)

#Optional: set backend/compile
#You will need to install CUDA kernels apriori
# git clone https://github.com/mobiusml/hqq/
# cd hqq/kernels && python setup_cuda.py install
from hqq.core.quantize import *
HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)


def chat_processor(chat, max_new_tokens=100, do_sample=True):
    tokenizer.use_default_system_prompt = False
    streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)

    generate_params = dict(
        tokenizer("<s> [INST] " + chat + " [/INST] ", return_tensors="pt").to('cuda'),
        streamer=streamer,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        top_p=0.90,
        top_k=50,
        temperature= 0.6,
        num_beams=1,
        repetition_penalty=1.2,
    )

    t = Thread(target=model.generate, kwargs=generate_params)
    t.start()
    outputs = []
    for text in streamer:
        outputs.append(text)
        print(text, end="", flush=True)

    return outputs

################################################################################################
#Generation
outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)

Quantization

You can reproduce the model using the following quant configs:

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

model_id  = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model     = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)

#Quantize params
from hqq.core.quantize import *
attn_prams     = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True) 
experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True) 
attn_prams['scale_quant_params']['group_size'] = 256
attn_prams['zero_quant_params']['group_size']  = 256

quant_config = {}
#Attention
quant_config['self_attn.q_proj'] = attn_prams
quant_config['self_attn.k_proj'] = attn_prams
quant_config['self_attn.v_proj'] = attn_prams
quant_config['self_attn.o_proj'] = attn_prams
#Experts
quant_config['block_sparse_moe.experts.w1'] = experts_params
quant_config['block_sparse_moe.experts.w2'] = experts_params
quant_config['block_sparse_moe.experts.w3'] = experts_params

#Quantize
model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
model.eval();

The code in github at https://github.com/mobiusml/hqq/blob/master/examples/hf/mixtral_13GB_example.py