VibeThinker-3B-FP8-block

FP8 (block-wise) quantization of a VibeThinker-3B / Qwen2.5-3B reasoning model, ready for inference with vLLM on NVIDIA Hopper/Ada GPUs (H100/H200/L40/4090).

Quantized by Hert4.

What this is

Weights and activations are quantized from BF16 to FP8 (E4M3) using LLM Compressor with the FP8_BLOCK scheme:

Weights FP8 E4M3, block-wise scaling, block_structure = [128, 128]
Activations FP8 E4M3, dynamic, per-group group_size = 128
Kept in original precision lm_head, embed_tokens
Format compressed-tensors (float-quantized)
Size ~3.2 GB (≈50% of BF16)

This is the same recipe used by DeepSeek-V3-style FP8 checkpoints and pairs with DeepGEMM block-FP8 kernels. The checkpoint is self-contained — the original BF16 model is not required at runtime.

Note: derived from an abliterated (uncensored) VibeThinker-3B finetune — safety behavior differs from the original base. Use responsibly; you are responsible for outputs.

Usage (vLLM)

vllm serve <your-namespace>/VibeThinker-3B-FP8-block \
  --tensor-parallel-size 1 \
  --max-model-len 32768

vLLM auto-detects the FP8 quantization from config.json — do not pass --quantization.

Creation

from llmcompressor import model_free_ptq

model_free_ptq(
    model_stub="<source-bf16-model>",
    save_directory="VibeThinker-3B-FP8-block",
    scheme="FP8_BLOCK",
    ignore=["lm_head", "re:.*embed_tokens.*"],
    device="cpu",
)

Credits

Downloads last month
12
Safetensors
Model size
3B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for beyoru/VibeThinker-3B-FP8-block

Base model

Qwen/Qwen2.5-3B
Quantized
(45)
this model