Quantized using llmcompressor 0.9.0 with following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', lm_head]
      scheme: FP8_DYNAMIC

You can deploy the model using sglang with following command:

python3 -m sglang.launch_server \
        --model-path CHNtentes/Nanbeige4-3B-Thinking-2511-FP8-DYNAMIC \
        --host 0.0.0.0 \
        --port 9999 \
        --trust-remote-code \
        --mem-fraction-static 0.9 \
        --context-length 16384

Modify the context length based on your vram size.

Downloads last month
2
Safetensors
Model size
4B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CHNtentes/Nanbeige4-3B-Thinking-2511-FP8-DYNAMIC

Quantized
(13)
this model