Qwen2.5-Coder-3B-MLX-nvfp4

This repository contains the 4-bit NVFP4 quantized weights for Qwen/Qwen2.5-Coder-3B, optimized for low-latency inference on Apple Silicon using the oMLX framework.

Qwen2.5-Coder-3B is the ultra-lightweight entry in the Qwen2.5 coding specialist series. Despite its exceptionally compact 3 billion parameter footprint, it inherits the advanced architectural and training enhancements of the broader Qwen2.5-Coder family, making it uniquely suited for fast, edge-based autocomplete, inline code generation, and low-resource deployments.


🚀 Efficiency & Performance Advantages

By combining the highly efficient 3B parameter base model with a 4-bit NVFP4 quantization mapping, this variant achieves:

  • ⚡ Blazing-Fast Generation (TPS): Exceptional token generation and prefill speeds, allowing for near-instantaneous IDE code completions.
  • 📉 Minimal Memory Footprint: Extremely small VRAM utilization, freeing up system resources to comfortably run alongside heavy local developer environments.
  • ⚙️ Seamless Mac Optimization: Native acceleration when coupled with modern execution layers like oMLX on Apple Silicon.

🛠️ Deployment & Execution Quickstart

To utilize this model on macOS, ensure you are running an inference wrapper configured to handle nvfp4 metadata structures.

Running with oMLX

Execute local evaluation benches natively via terminal: omlx bench --model your-hf-username/Qwen2.5-Coder-3B-MLX-nvfp4 --prompt "Write a Python function to clear a list."

Benchmark table
Downloads last month
158
Safetensors
Model size
0.8B params
Tensor type
U8
·
U32
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bkideas/Qwen2.5-Coder-3B-MLX-nvfp4

Base model

Qwen/Qwen2.5-3B
Quantized
(33)
this model