Support this work → · X · GitHub · REAP paper · Cerebras REAP

Qwen3.5-264B-FP8

FP8 quantization of 0xSero/Qwen3.5-264B.

At a glance

Base model 0xSero/Qwen3.5-264B
Format FP8
Total params 264B
Active / token —
Experts / layer 336
Layers 60
Hidden size 4096
Context 262,144
On-disk size 272 GB

Which variant should I pick?

Variant Format Link
Qwen3.5-264B BF16 link
Qwen3.5-264B-FP8 (this) FP8 link
Qwen3.5-264B-W4A16 W4A16 link
Qwen3.5-28B BF16 link
Qwen3.5-35B-EXL3-4bpw EXL3-4bpw link
Qwen3.5-76B BF16 link
Qwen3.5-76B-GGUF GGUF link
Qwen3.5-88B BF16 link
Qwen3.5-99B BF16 link
Qwen3.5-99B-GGUF GGUF link
  • Repository: 0xSero/Qwen3.5-264B-FP8
  • Base model: 0xSero/Qwen3.5-264B (BF16)
  • Original model: Qwen/Qwen3.5-397B-A17B
  • Artifact kind: pruned + quantized
  • Quantization: FP8 W8A8 (float8_e4m3fn), per-tensor dynamic
  • Compression ratio: ~48% (from BF16 REAP checkpoint)

Details

  • Maintainer: 0xSero
  • Organization: Sybil Solutions
  • Project: REAP PR17

Model Description

FP8 quantized version of Qwen3.5-264B-REAP, a REAP-pruned Qwen3.5-397B-A17B with 176 of 512 experts removed per MoE layer, retaining 336 experts per layer, for an estimated 264B total parameters.

Quantization Details

  • Format: float8_e4m3fn (FP8 E4M3, 4-bit exponent, 3-bit mantissa)
  • Scheme: Per-tensor symmetric dynamic quantization
  • Targets: All Linear layer weights (q/k/v/o projections, gate/up/down projections, MoE expert and shared expert projections)
  • Ignored: lm_head, layer norms, embeddings, biases
  • Serialization: compressed-tensors format (native vLLM/SGLang support)
  • Size: ~253GB on disk (down from ~491GB BF16)

Hardware Compatibility

  • NVIDIA Ada Lovelace (SM89): via Marlin FP8 kernel
  • NVIDIA Hopper (SM90): native FP8 tensor core support
  • NVIDIA Blackwell (SM100/SM120): native FP8 tensor core support

Vision Encoder

Vision encoder weights from Qwen/Qwen3.5-397B-A17B are included (333 tensors, ~910MB BF16). The VL encoder is not quantized -- it remains in original BF16 precision.

Usage

vLLM

vllm serve 0xSero/Qwen3.5-264B-FP8 \
  --tensor-parallel-size 4 \
  --quantization fp8 \
  --kv-cache-dtype auto \
  --trust-remote-code

Note: Use --kv-cache-dtype auto (not fp8_e4m3) on Blackwell (SM120) GPUs to avoid garbled output.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/Qwen3.5-264B-FP8",
    trust_remote_code=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/Qwen3.5-264B-FP8", trust_remote_code=True)

Provenance

  • Source checkpoint: 0xSero/Qwen3.5-264B (BF16, REAP-pruned)
  • Quantization method: Per-tensor dynamic FP8 cast with compressed-tensors config
  • Quantization compute: CPU-only (no calibration data required for weight-only FP8)

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
18
Safetensors
Model size
264B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/Qwen3.5-264B-FP8

Quantized
(1)
this model

Datasets used to train 0xSero/Qwen3.5-264B-FP8

Collection including 0xSero/Qwen3.5-264B-FP8

Paper for 0xSero/Qwen3.5-264B-FP8