Laguna-XS-2.1 (MLX, 3bit)

Converted from poolside/Laguna-XS-2.1 to MLX format, quantized to 3 bits (group size 64, 3.503 bpw effective).

Notes

  • Works with mlx-vlm and oMLX (forcing the model's vlm mode). mlx-lm doesn't support the laguna architecture yet — there's an open PR: mlx-lm#1223.
  • Sometimes I got an empty </think> tag at the start of responses, which isn't that common. It won't affect anything tho.

Performance

Measured with oMLX's benchmark harness on a Macbook Pro M5 Max 128GB 40 GPU (single request, 128 generated tokens):

prompt gen tok/s prefill tok/s TTFT ms peak GB
1k 137.2 3959 259 14.3
4k 128.8 4003 1023 14.9
8k 124.4 3807 2152 15.0
16k 114.6 3214 5098 15.3
32k 98.8 2612 12546 15.9

Variants

Variant bpw Disk gen tok/s (1k → 32k)
bf16 16 62 GB 70.6 → 58.7
8bit 8.500 33 GB 95.4 → 76.7
6bit 6.501 25 GB 102.9 → 80.9
5bit 5.502 21 GB 115.9 → 87.7
4bit 4.503 18 GB 126.0 → 91.3
3bit (this repo) 3.503 14 GB 137.2 → 98.8

Usage

uvx --from mlx-vlm mlx_vlm.generate --model mlx-community/Laguna-XS-2.1-3bit --prompt "..." --max-tokens 300

License

OpenMDW-1.1, inherited from the base model.

Downloads last month
25
Safetensors
Model size
4B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Laguna-XS-2.1-3bit

Quantized
(14)
this model

Collection including mlx-community/Laguna-XS-2.1-3bit