Nex-N2-mini-FP8-RTN
FP8 block-quantized version of Nex-N2-mini using per-expert RTN (round-to-nearest) quantization.
Quantized with RTN in 128×128 blocks using per-expert weight format. MoE expert weights are stored as float8_e4m3fn; all other layers (attention, layernorms, router gates, vision encoder, lm_head, embeddings) remain in BF16.
Format
- Quant method:
fp8(native vLLM format, notcompressed-tensors) - Weight block size:
[128, 128] - Scale type:
bfloat16per-blockweight_scale_inv - Checkpoint size: ~36 GB (vs ~70 GB BF16 original)
- Serving: Requires vLLM with
--linear-backend tritonon Blackwell GPUs
Inference
docker run --rm -it --gpus all --shm-size=32g \
-v /path/to/Nex-N2-mini-FP8-RTN:/models/model \
-p 8000:8000 \
vllm/vllm-openai:latest \
/models/model \
--host 0.0.0.0 --port 8000 \
--served-model-name Nex-N2-mini-FP8-RTN \
--max-model-len 262144 \
--gpu-memory-utilization 0.80 \
--trust-remote-code \
--linear-backend triton
For tool calling and reasoning, add:
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder
Technical notes
This is a hand-rolled RTN quantization (no calibration data, no llm-compressor). Each expert weight is independently block-scaled using per-block absmax / 448.0 scaling. See fp8-moe-rtn-quant skill for the full methodology and pitfalls.
- Downloads last month
- 4,218
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for vio1ator/Nex-N2-mini-FP8-RTN
Base model
nex-agi/Nex-N2-mini