Instructions to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw") model = AutoModelForCausalLM.from_pretrained("xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw
- SGLang
How to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with Docker Model Runner:
docker model run hf.co/xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw
Devstral-Small-2-24B-Instruct — GLQ 4bpw
GLQ (E8 Lattice Quantization) compressed version of mistralai/Devstral-Small-2-24B-Instruct-2512 at 4 bits per weight.
| Original (FP8) | GLQ 4bpw | |
|---|---|---|
| Size | ~48 GB (bf16 equiv) | 20.5 GB |
| Bits/weight | 8 (FP8) | 4.0 |
| Avg SQNR | — | 22.34 dB |
| GPU VRAM | ~48 GB | ~22 GB |
How GLQ works
GLQ uses the E8 lattice codebook (65,536 vectors in 8 dimensions) combined with:
- Randomized Hadamard Transform (RHT) for weight incoherence
- LDLQ (Lattice Decoding with LDL Quantization) for Hessian-aware rounding
- Two-stage RVQ for 3/4bpw: primary E8 codebook + secondary residual codebook
Each 8-weight block is encoded as a 16-bit index into the codebook, achieving exactly 2.0 bits per weight at the base level, or 4.0 bpw with residual quantization.
Installation
This model requires the glq runtime package (it supplies the E8 codebook, the HF Transformers integration, and the fused CUDA kernels):
pip install "glq>=0.2.8"
glq also registers a "glq" quantization method with both HuggingFace Transformers and (via entry_points) vLLM, so no separate plugin install is needed.
A CUDA GPU and the NVIDIA toolchain (nvcc + ninja) are required on first import because the GLQ CUDA kernels are JIT-compiled via torch.utils.cpp_extension. Devstral-24B at 4bpw uses ~22 GB of GPU memory (bf16 would need ~48 GB), so it fits on an L40S / A100 40 GB.
Usage with HuggingFace Transformers
transformers 5.x auto-routes Mistral/Devstral models through mistral_common, which rejects the standard tokenizer.json shipped in this repo. Use PreTrainedTokenizerFast directly:
import glq.hf_integration # registers GLQ quant method with transformers
import torch
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
model_id = "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw"
path = snapshot_download(model_id)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{path}/tokenizer.json")
tokenizer.pad_token = "<pad>"
tokenizer.eos_token = "</s>"
tokenizer.bos_token = "<s>"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
dtype=torch.float16,
)
inputs = tokenizer("Write a Python function that computes the Fibonacci sequence", return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
The import glq.hf_integration line registers "glq" as a quantization method with HuggingFace Transformers so that from_pretrained reads quantization_config.quant_method = "glq" from config.json, swaps every nn.Linear for E8RHTLinear, and wires up the fused CUDA path automatically.
For a ready-to-run version with the tokenizer fallback handled automatically, see examples/inference_hf.py in the GLQ repo:
python examples/inference_hf.py \
--model xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw \
--prompt "Write a Python function that computes the Fibonacci sequence" \
--max-tokens 100
Usage with vLLM
import glq_vllm # or just: import glq (registers vLLM plugin via entry_points)
from vllm import LLM, SamplingParams
llm = LLM(
model="xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw",
tokenizer="mistralai/Devstral-Small-2-24B-Instruct-2512",
quantization="glq",
trust_remote_code=True,
gpu_memory_utilization=0.95,
max_model_len=4096,
enforce_eager=True,
)
sp = SamplingParams(max_tokens=200, temperature=0.7)
output = llm.generate(["Write a Python fibonacci function:"], sp)
print(output[0].outputs[0].text)
Quantization details
- Base model: mistralai/Devstral-Small-2-24B-Instruct-2512 (FP8 weights, dequantized to bf16 during quantization)
- Method: E8 Shell codebook + RHT + LDLQ, 4bpw (two-stage RVQ)
- Calibration: 128 samples × 2048 tokens from WikiText-2
- Layers: 40 transformer layers, 280 sublayers quantized
- Time: ~31 minutes on NVIDIA L40S (streaming mode)
- Architecture: Ministral3 (text backbone of Mistral3 multimodal)
License
Apache 2.0 — same as the base model.
- Downloads last month
- 6
Model tree for xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503