Devstral-Small-2-24B-Instruct — GLQ 4bpw

GLQ (E8 Lattice Quantization) compressed version of mistralai/Devstral-Small-2-24B-Instruct-2512 at 4 bits per weight.

Original (FP8) GLQ 4bpw
Size ~48 GB (bf16 equiv) 20.5 GB
Bits/weight 8 (FP8) 4.0
Avg SQNR 22.34 dB
GPU VRAM ~48 GB ~22 GB

How GLQ works

GLQ uses the E8 lattice codebook (65,536 vectors in 8 dimensions) combined with:

  • Randomized Hadamard Transform (RHT) for weight incoherence
  • LDLQ (Lattice Decoding with LDL Quantization) for Hessian-aware rounding
  • Two-stage RVQ for 3/4bpw: primary E8 codebook + secondary residual codebook

Each 8-weight block is encoded as a 16-bit index into the codebook, achieving exactly 2.0 bits per weight at the base level, or 4.0 bpw with residual quantization.

Installation

This model requires the glq runtime package (it supplies the E8 codebook, the HF Transformers integration, and the fused CUDA kernels):

pip install "glq>=0.2.8"

glq also registers a "glq" quantization method with both HuggingFace Transformers and (via entry_points) vLLM, so no separate plugin install is needed.

A CUDA GPU and the NVIDIA toolchain (nvcc + ninja) are required on first import because the GLQ CUDA kernels are JIT-compiled via torch.utils.cpp_extension. Devstral-24B at 4bpw uses ~22 GB of GPU memory (bf16 would need ~48 GB), so it fits on an L40S / A100 40 GB.

Usage with HuggingFace Transformers

transformers 5.x auto-routes Mistral/Devstral models through mistral_common, which rejects the standard tokenizer.json shipped in this repo. Use PreTrainedTokenizerFast directly:

import glq.hf_integration  # registers GLQ quant method with transformers
import torch
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast

model_id = "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw"
path = snapshot_download(model_id)

tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{path}/tokenizer.json")
tokenizer.pad_token = "<pad>"
tokenizer.eos_token = "</s>"
tokenizer.bos_token = "<s>"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    dtype=torch.float16,
)

inputs = tokenizer("Write a Python function that computes the Fibonacci sequence", return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The import glq.hf_integration line registers "glq" as a quantization method with HuggingFace Transformers so that from_pretrained reads quantization_config.quant_method = "glq" from config.json, swaps every nn.Linear for E8RHTLinear, and wires up the fused CUDA path automatically.

For a ready-to-run version with the tokenizer fallback handled automatically, see examples/inference_hf.py in the GLQ repo:

python examples/inference_hf.py \
    --model xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw \
    --prompt "Write a Python function that computes the Fibonacci sequence" \
    --max-tokens 100

Usage with vLLM

import glq_vllm  # or just: import glq  (registers vLLM plugin via entry_points)
from vllm import LLM, SamplingParams

llm = LLM(
    model="xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw",
    tokenizer="mistralai/Devstral-Small-2-24B-Instruct-2512",
    quantization="glq",
    trust_remote_code=True,
    gpu_memory_utilization=0.95,
    max_model_len=4096,
    enforce_eager=True,
)

sp = SamplingParams(max_tokens=200, temperature=0.7)
output = llm.generate(["Write a Python fibonacci function:"], sp)
print(output[0].outputs[0].text)

Quantization details

  • Base model: mistralai/Devstral-Small-2-24B-Instruct-2512 (FP8 weights, dequantized to bf16 during quantization)
  • Method: E8 Shell codebook + RHT + LDLQ, 4bpw (two-stage RVQ)
  • Calibration: 128 samples × 2048 tokens from WikiText-2
  • Layers: 40 transformer layers, 280 sublayers quantized
  • Time: ~31 minutes on NVIDIA L40S (streaming mode)
  • Architecture: Ministral3 (text backbone of Mistral3 multimodal)

License

Apache 2.0 — same as the base model.

Downloads last month
6
Safetensors
Model size
10B params
Tensor type
BF16
·
F16
·
I16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw