Laguna-XS-2.1 (MLX, 5bit)

Converted from poolside/Laguna-XS-2.1 to MLX format, quantized to 5 bits (group size 64, 5.502 bpw effective).

Notes

  • Works with mlx-vlm and oMLX (forcing the model's vlm mode). mlx-lm doesn't support the laguna architecture yet — there's an open PR: mlx-lm#1223.
  • Sometimes I got an empty </think> tag at the start of responses, which isn't that common. It won't affect anything tho.

Performance

Measured with oMLX's benchmark harness on a Macbook Pro M5 Max 128GB 40 GPU (single request, 128 generated tokens):

prompt gen tok/s prefill tok/s TTFT ms peak GB
1k 115.9 2461 416 22.1
4k 111.5 3820 1073 22.7
8k 106.9 3461 2367 22.7
16k 100.9 2991 5478 23.1
32k 87.7 2381 13764 23.7

Variants

Variant bpw Disk gen tok/s (1k → 32k)
bf16 16 62 GB 70.6 → 58.7
8bit 8.500 33 GB 95.4 → 76.7
6bit 6.501 25 GB 102.9 → 80.9
5bit (this repo) 5.502 21 GB 115.9 → 87.7
4bit 4.503 18 GB 126.0 → 91.3
3bit 3.503 14 GB 137.2 → 98.8

Usage

uvx --from mlx-vlm mlx_vlm.generate --model mlx-community/Laguna-XS-2.1-5bit --prompt "..." --max-tokens 300

License

OpenMDW-1.1, inherited from the base model.

Downloads last month
27
Safetensors
Model size
6B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Laguna-XS-2.1-5bit

Quantized
(14)
this model

Collection including mlx-community/Laguna-XS-2.1-5bit