Create README.md

0ae6d1e verified 5 months ago

No virus

4.71 kB

	---
	license: apache-2.0
	tags:
	- moe
	train: false
	inference: false
	pipeline_tag: text-generation
	---
	## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
	This is a version of the
	<a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 3-bit via Half-Quadratic Quantization (HQQ).

	More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 3-bit.

	Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ"> 2bitgs8 model </a> that was designed to use less GPU memory, this one uses about 22GB for the folks who want to get better quality and use the maximum VRAM available on 24GB GPUs.

	![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)

	----------------------------------------------------------------------------------------------------------------------------------
	</p>

	## Performance
	\| Models \| Mixtral Original \| HQQ quantized \|
	\|-------------------\|------------------\|------------------\|
	\| Runtime VRAM \| 94 GB \| <b> 22.3 GB</b> \|
	\| ARC (25-shot) \| 70.22 \| 69.62 \|
	\| Hellaswag (10-shot)\| 87.63 \| \|
	\| MMLU (5-shot) \| 71.16 \| \|
	\| TruthfulQA-MC2 \| 64.58 \| 62.63 \|
	\| Winogrande (5-shot)\| 81.37 \| 81.06 \|
	\| GSM8K (5-shot)\| 60.73 \| \|
	\| Average\| 72.62 \| \|

	### Basic Usage
	To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
	``` Python
	import transformers
	from threading import Thread

	model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ'
	#Load the model
	from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = HQQModelForCausalLM.from_quantized(model_id)

	#Optional: set backend/compile
	#You will need to install CUDA kernels apriori
	# git clone https://github.com/mobiusml/hqq/
	# cd hqq/kernels && python setup_cuda.py install
	from hqq.core.quantize import *
	HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)


	def chat_processor(chat, max_new_tokens=100, do_sample=True):
	tokenizer.use_default_system_prompt = False
	streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)

	generate_params = dict(
	tokenizer("<s> [INST] " + chat + " [/INST] ", return_tensors="pt").to('cuda'),
	streamer=streamer,
	max_new_tokens=max_new_tokens,
	do_sample=do_sample,
	top_p=0.90,
	top_k=50,
	temperature= 0.6,
	num_beams=1,
	repetition_penalty=1.2,
	)

	t = Thread(target=model.generate, kwargs=generate_params)
	t.start()
	outputs = []
	for text in streamer:
	outputs.append(text)
	print(text, end="", flush=True)

	return outputs

	################################################################################################
	#Generation
	outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)
	```


	### Quantization

	You can reproduce the model using the following quant configs:

	``` Python
	from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

	model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
	model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)

	#Quantize params
	from hqq.core.quantize import *
	attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
	experts_params = BaseQuantizeConfig(nbits=3, group_size=64, offload_meta=True)

	zero_scale_group_size = 128
	attn_prams['scale_quant_params']['group_size'] = zero_scale_group_size
	attn_prams['zero_quant_params']['group_size'] = zero_scale_group_size
	experts_params['scale_quant_params']['group_size'] = zero_scale_group_size
	experts_params['zero_quant_params']['group_size'] = zero_scale_group_size

	quant_config = {}
	#Attention
	quant_config['self_attn.q_proj'] = attn_prams
	quant_config['self_attn.k_proj'] = attn_prams
	quant_config['self_attn.v_proj'] = attn_prams
	quant_config['self_attn.o_proj'] = attn_prams
	#Experts
	quant_config['block_sparse_moe.experts.w1'] = experts_params
	quant_config['block_sparse_moe.experts.w2'] = experts_params
	quant_config['block_sparse_moe.experts.w3'] = experts_params

	#Quantize
	model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
	model.eval();
	```