Laguna-XS-2.1 (MLX, bf16)

Converted from poolside/Laguna-XS-2.1 to MLX format, in bfloat16 (full precision).

Notes

  • Works with mlx-vlm and oMLX (forcing the model's vlm mode). mlx-lm doesn't support the laguna architecture yet — there's an open PR: mlx-lm#1223.
  • Sometimes I got an empty </think> tag at the start of responses, which isn't that common. It won't affect anything tho.

Performance

Measured with oMLX's benchmark harness on a Macbook Pro M5 Max 128GB 40 GPU (single request, 128 generated tokens):

prompt gen tok/s prefill tok/s TTFT ms peak GB
1k 70.6 1104 929 63.0
4k 69.2 3138 1306 63.4
8k 67.0 3507 2336 63.6
16k 63.8 3020 5426 63.9
32k 58.7 2499 13114 64.5

Variants

Variant bpw Disk gen tok/s (1k → 32k)
bf16 (this repo) 16 62 GB 70.6 → 58.7
8bit 8.500 33 GB 95.4 → 76.7
6bit 6.501 25 GB 102.9 → 80.9
5bit 5.502 21 GB 115.9 → 87.7
4bit 4.503 18 GB 126.0 → 91.3
3bit 3.503 14 GB 137.2 → 98.8

Usage

uvx --from mlx-vlm mlx_vlm.generate --model mlx-community/Laguna-XS-2.1-bf16 --prompt "..." --max-tokens 300

License

OpenMDW-1.1, inherited from the base model.

Downloads last month
40
Safetensors
Model size
33B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Laguna-XS-2.1-bf16

Finetuned
(2)
this model

Collection including mlx-community/Laguna-XS-2.1-bf16