Instructions to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw")
model = AutoModelForCausalLM.from_pretrained("xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw

SGLang

How to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw with Docker Model Runner:
```
docker model run hf.co/xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw
```

Devstral-Small-2-24B-Instruct — GLQ 4bpw

GLQ (E8 Lattice Quantization) compressed version of mistralai/Devstral-Small-2-24B-Instruct-2512 at 4 bits per weight.

	Original (FP8)	GLQ 4bpw
Size	~48 GB (bf16 equiv)	20.5 GB
Bits/weight	8 (FP8)	4.0
Avg SQNR	—	22.34 dB
GPU VRAM	~48 GB	~22 GB

How GLQ works

GLQ uses the E8 lattice codebook (65,536 vectors in 8 dimensions) combined with:

Randomized Hadamard Transform (RHT) for weight incoherence
LDLQ (Lattice Decoding with LDL Quantization) for Hessian-aware rounding
Two-stage RVQ for 3/4bpw: primary E8 codebook + secondary residual codebook

Each 8-weight block is encoded as a 16-bit index into the codebook, achieving exactly 2.0 bits per weight at the base level, or 4.0 bpw with residual quantization.

Installation

This model requires the glq runtime package (it supplies the E8 codebook, the HF Transformers integration, and the fused CUDA kernels):

pip install "glq>=0.2.8"

glq also registers a "glq" quantization method with both HuggingFace Transformers and (via entry_points) vLLM, so no separate plugin install is needed.

A CUDA GPU and the NVIDIA toolchain (nvcc + ninja) are required on first import because the GLQ CUDA kernels are JIT-compiled via torch.utils.cpp_extension. Devstral-24B at 4bpw uses ~22 GB of GPU memory (bf16 would need ~48 GB), so it fits on an L40S / A100 40 GB.

Usage with HuggingFace Transformers

transformers 5.x auto-routes Mistral/Devstral models through mistral_common, which rejects the standard tokenizer.json shipped in this repo. Use PreTrainedTokenizerFast directly:

import glq.hf_integration  # registers GLQ quant method with transformers
import torch
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast

model_id = "xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw"
path = snapshot_download(model_id)

tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{path}/tokenizer.json")
tokenizer.pad_token = "<pad>"
tokenizer.eos_token = "</s>"
tokenizer.bos_token = "<s>"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    dtype=torch.float16,
)

inputs = tokenizer("Write a Python function that computes the Fibonacci sequence", return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The import glq.hf_integration line registers "glq" as a quantization method with HuggingFace Transformers so that from_pretrained reads quantization_config.quant_method = "glq" from config.json, swaps every nn.Linear for E8RHTLinear, and wires up the fused CUDA path automatically.

For a ready-to-run version with the tokenizer fallback handled automatically, see examples/inference_hf.py in the GLQ repo:

python examples/inference_hf.py \
    --model xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw \
    --prompt "Write a Python function that computes the Fibonacci sequence" \
    --max-tokens 100

Usage with vLLM

import glq_vllm  # or just: import glq  (registers vLLM plugin via entry_points)
from vllm import LLM, SamplingParams

llm = LLM(
    model="xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw",
    tokenizer="mistralai/Devstral-Small-2-24B-Instruct-2512",
    quantization="glq",
    trust_remote_code=True,
    gpu_memory_utilization=0.95,
    max_model_len=4096,
    enforce_eager=True,
)

sp = SamplingParams(max_tokens=200, temperature=0.7)
output = llm.generate(["Write a Python fibonacci function:"], sp)
print(output[0].outputs[0].text)

Quantization details

Base model: mistralai/Devstral-Small-2-24B-Instruct-2512 (FP8 weights, dequantized to bf16 during quantization)
Method: E8 Shell codebook + RHT + LDLQ, 4bpw (two-stage RVQ)
Calibration: 128 samples × 2048 tokens from WikiText-2
Layers: 40 transformer layers, 280 sublayers quantized
Time: ~31 minutes on NVIDIA L40S (streaming mode)
Architecture: Ministral3 (text backbone of Mistral3 multimodal)

License

Apache 2.0 — same as the base model.

Downloads last month: 6

Safetensors

Model size

10B params

Tensor type

BF16

F16

I16

Model tree for xv0y5ncu/Devstral-Small-2-24B-Instruct-GLQ-4bpw

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Quantized

mistralai/Devstral-Small-2-24B-Instruct-2512

Quantized

(35)

this model