This model is a quantized NVFP4 MLX variant of LiquidAI/LFM2.5‑8B‑A1B‑MLX‑bf16, created by LiquidAI. Original model licensed under the LiquidAI Model License.

NVFP4 MLX Quantization — Performance & Quality

This model is a 4‑bit NVFP4 MLX‑quantized variant of the original BF16 LFM2.5‑8B‑A1B model. NVFP4 is MLX’s optimized 4‑bit format designed for efficient inference on Apple Silicon GPUs.

Why NVFP4?

NVFP4 reduces memory usage by ~65% and increases generation speed by ~1.6–1.8× on M‑series chips, while preserving most of the model’s quality.

Performance Comparison (Representative MLX Benchmarks)

Metric	BF16	NVFP4	Notes
Memory usage	~15 GB	~5 GB	Fits on 16 GB Macs
Token speed (M5 Max)	~41 tok/s	~72 tok/s	~1.75× faster
Perplexity	1.00×	1.02–1.03×	~2–3% degradation
Output quality	Baseline	~95–98% identical	Minor reasoning loss

Pros

Much lower memory footprint
Faster inference on macOS
Lower power usage
Ideal for laptops and smaller RAM configs

Cons

Slight quality degradation (1–3%)
Not suitable for fine‑tuning
Slightly more drift in very long generations

Practical Impact

For chat, summarization, and coding, NVFP4 behaves almost identically to the BF16 model.
For math/logic‑heavy tasks, BF16 remains slightly more accurate.