mobiuslabsgmbh/Llama-3.1-70b-instruct_4bitgs64_hqq

This is an HQQ all 4-bit (group-size=64) quantized Llama3.1-70B-Instruct model.

Model Size

Models	fp16	HQQ 4-bit/gs-64
Bitrate (Linear layers)	16	4.5
VRAM (GB)	140	42.7

Model Decoding Speed

Models	fp16	HQQ 4-bit/gs-64
Decoding - short seq (tokens/sec)	10.5 (tokens/sec)**	23 (tokens/sec)*
Decoding - long seq (tokens/sec)	9.5 (tokens/sec)**	19 (tokens/sec)*

**: 2xA100 80GB
*: 1xA100 80GB

Performance

Models	fp16	HQQ 4-bit/gs-64
ARC (25-shot)	70.31	70.22
HellaSwag (10-shot)	86.40	86.39
MMLU (5-shot)	81.84	81.04
TruthfulQA-MC2	59.83	60.39
Winogrande (5-shot)	84.85	84.53
GSM8K (5-shot)	88.25	89.92
Average	78.58	78.75

You can reproduce the results above via pip install lm-eval==0.4.3

Usage

First, install the dependecies:

pip install git+https://github.com/mobiusml/hqq.git #master branch fix
pip install bitblas

Also, make sure you use at least torch 2.4.0 or the nightly build.

Then you can use the sample code below:

import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.utils.patching import *
from hqq.core.quantize import *
from hqq.utils.generation_hf import HFGenerator

#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3.1-70b-instruct_4bitgs64_hqq'

compute_dtype = torch.bfloat16 #bfloat16 for torchao, float16 for bitblas
cache_dir = '.'
model     = AutoHQQHFModel.from_quantized(model_id, cache_dir=cache_dir, compute_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1)
patch_linearlayers(model, patch_add_quant_config, quant_config)

#Use optimized inference kernels
###################################################
HQQLinear.set_backend(HQQBackend.PYTORCH)
#prepare_for_inference(model) #default backend
prepare_for_inference(model, backend="torchao_int4") 
#prepare_for_inference(model, backend="bitblas") #takes a while to init...

#Generate
###################################################
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Warm-up takes a while

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)

mobiuslabsgmbh
/

Llama-3.1-70b-instruct_4bitgs64_hqq

Model Size

Model Decoding Speed

Performance

Usage

Collection including mobiuslabsgmbh/Llama-3.1-70b-instruct_4bitgs64_hqq

Llama3 HQQ