Nex-N2-mini-FP8-RTN

FP8 block-quantized version of Nex-N2-mini using per-expert RTN (round-to-nearest) quantization.

Quantized with RTN in 128×128 blocks using per-expert weight format. MoE expert weights are stored as float8_e4m3fn; all other layers (attention, layernorms, router gates, vision encoder, lm_head, embeddings) remain in BF16.

Format

  • Quant method: fp8 (native vLLM format, not compressed-tensors)
  • Weight block size: [128, 128]
  • Scale type: bfloat16 per-block weight_scale_inv
  • Checkpoint size: ~36 GB (vs ~70 GB BF16 original)
  • Serving: Requires vLLM with --linear-backend triton on Blackwell GPUs

Inference

docker run --rm -it --gpus all --shm-size=32g \
  -v /path/to/Nex-N2-mini-FP8-RTN:/models/model \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  /models/model \
    --host 0.0.0.0 --port 8000 \
    --served-model-name Nex-N2-mini-FP8-RTN \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.80 \
    --trust-remote-code \
    --linear-backend triton

For tool calling and reasoning, add:

    --enable-auto-tool-choice \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

Technical notes

This is a hand-rolled RTN quantization (no calibration data, no llm-compressor). Each expert weight is independently block-scaled using per-block absmax / 448.0 scaling. See fp8-moe-rtn-quant skill for the full methodology and pitfalls.

Downloads last month
4,218
Safetensors
Model size
35B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vio1ator/Nex-N2-mini-FP8-RTN

Quantized
(51)
this model