README.md · mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ at refs/pr/1

metadata

license: apache-2.0
tags:
  - moe
train: false
inference: false
pipeline_tag: text-generation

Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ

This is a version of the Mixtral-8x7B-Instruct-v0.1 model (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).

More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit. This model should perform a lot better compared to the all 2-bit model for a slight increase in model size (18.2GB vs. 18GB).

This idea was suggest by Artem Eliseev (@lavawolfiee) and Denis Mazur (@dvmazur) in this Github discussion.

Basic Usage

To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:

model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ'
#Load the model
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = HQQModelForCausalLM.from_quantized(model_id)
#Optional
from hqq.core.quantize import *
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) 
#Text Generation
prompt = "<s> [INST] How do I build a car? [/INST] "
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
outputs = model.generate(**(inputs.to('cuda')), max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

Design the Car:

Determine the type of car you want to build (e.g., sedan, SUV, sports car) and its specifications (e.g., size, weight, horsepower, fuel efficiency).
Create detailed sketches and 3D models of the car's exterior and interior.
Design the car's technical components, including the engine, transmission, brakes, and suspension system.

Acquire Necessary Materials and Parts:
- Purchase or manufacture the necessary materials, such as steel, aluminum, and plastics.
- Obtain or manufacture the required parts, such as the engine, transmission, brakes, suspension system, and electrical components.
Set Up a Production Facility:
- Establish a manufacturing facility with the necessary equipment, such as assembly lines, paint booths, and welding machines.
- Hire a skilled workforce to oversee production and ensure quality control.
Manufacture the Car:
- Follow the design specifications to assemble the car's components.
- Perform rigorous testing to ensure the car meets safety and performance standards.
Market and Sell the Car:
- Develop a marketing strategy to promote the car to potential buyers.
- Establish a distribution network to sell the car through dealerships or online platforms.
Provide After-Sales Support:
- Offer maintenance and repair services to ensure customer satisfaction and loyalty.
- Continuously improve the car's design and performance based on customer feedback and market trends.

Please note, building a car requires significant expertise, resources, and adherence to strict safety and regulatory standards. It is not a project that can be undertaken without extensive knowledge and experience in automotive engineering, manufacturing, and business management.

Quantization

You can reproduce the model using the following quant configs:

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
model_id  = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model     = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)

#Quantize params
from hqq.core.quantize import *
attn_prams     = BaseQuantizeConfig(nbits=4, group_size=64, quant_zero=True, quant_scale=True) 
attn_prams['scale_quant_params']['group_size'] = 256
experts_params = BaseQuantizeConfig(nbits=2, group_size=16, quant_zero=True, quant_scale=True) 

quant_config = {}
#Attention
quant_config['self_attn.q_proj'] = attn_prams
quant_config['self_attn.k_proj'] = attn_prams
quant_config['self_attn.v_proj'] = attn_prams
quant_config['self_attn.o_proj'] = attn_prams
#Experts
quant_config['block_sparse_moe.experts.w1'] = experts_params
quant_config['block_sparse_moe.experts.w2'] = experts_params
quant_config['block_sparse_moe.experts.w3'] = experts_params

#Quantize
model.quantize_model(quant_config=quant_config)