Qwen3.5-122B-A10B-NotaCompression-INT4

Nota AI compressed Qwen3.5-122B-A10B — a Mixture-of-Experts (MoE) LLM shrunk with MoE-aware INT4 quantization and global expert pruning, retaining near-original quality while running comfortably on a single H100.

250.17 GB → 69.49 GB (−72.22%)  ·  3.6× smaller
98.79% performance retained (avg. of 5 reasoning benchmarks)


📌 Highlights

  • MoE-specialized quantization — INT4 weight quantization tuned for the MoE structure, minimizing accuracy loss on MoE layers. (Method (1) ↗, Method (2) ↗)
  • Global expert-sensitivity pruning (15%) — instead of conventional uniform pruning that removes the same number of experts from every block, Nota measures a model-wide expert sensitivity score and prunes experts according to their true global importance. The most expendable experts are removed wherever they are, so blocks end up keeping different numbers of experts — far more favorable to quality preservation than uniform cuts.
  • Runs on a single H100 — most INT4-only quantized MoE models on the Hub still cannot fit on one H100, but this compressed model serves on a single H100 (80 GB) — and scales to higher throughput / longer context on 2 GPUs.
  • Quality retained98.79% of the BF16 baseline retained on average (5 reasoning benchmarks), within ~1–2 points across knowledge, math, reasoning, coding, and agentic tasks.

🧠 About Qwen3.5

Qwen3.5-122B-A10B is a large Mixture-of-Experts language model: it has ~122B total parameters but activates only ~10B per token by routing each token to a small subset of experts. This gives the capacity of a very large model at the inference cost of a much smaller one, with strong performance across reasoning, math, coding, and tool use.

This repository provides a compressed variant produced by Nota AI's compression pipeline.


🗜️ What Nota Compression Does

Stage Technique Effect
Quantization MoE-aware INT4 Weights packed to 4-bit; expert layers quantized with MoE-specific calibration
Pruning Global expert-sensitivity pruning, 15% removed Experts removed by model-wide importance score, not a fixed per-block quota

Unlike uniform pruning that removes a fixed number of experts from every block, Nota's method scores each expert by its global sensitivity across the whole model and removes only the most expendable ones. As a result different blocks retain a different number of experts — a non-uniform layout that preserves quality far better. The custom model file shipped here (see Patch vLLM) is required to support this non-uniform expert layout.


🚀 Usage

Environment

Install into a uv environment.

uv venv
uv pip install vllm==0.22.0

Required: vLLM 0.22.0

Patch vLLM (required)

This model uses a different number of experts per block. To support that layout, replace vLLM's model definition with the file provided in this repo:

cp patch/qwen3_5.py /path/to/vllm/model_executor/models/qwen3_5.py

🖥️ Serving with vLLM

Standard (H100 × 2)

vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

With tool calling

vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Single GPU (H100 × 1)

The following settings run comfortably on a single H100:

vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --max-num-seqs 96 \
  --gpu-memory-utilization 0.93

💡 On a single 80 GB GPU, KV-cache is the main constraint. If you hit max_num_seqs exceeds available Mamba cache blocks, lower --max-num-seqs or reduce --max-model-len to free cache.


📊 Benchmark Performance

Model MMLU-Pro
(Knowledge)
AIME 24&25
(Math)
GPQA Diamond
(STEM/Reasoning)
HumanEval
(Coding)
BFCL-V3
(Agent)
Average
Qwen3.5-122B-A10B (BF16) 86.42 93.33 85.35 94.51 95.00 90.92
Intel INT4 85.97 91.67 82.32 93.90 93.33 89.44 (−1.63%)
Qwen Official INT4 85.92 93.33 84.34 89.63 93.42 89.33 (−1.75%)
▶ Nota INT4 (this model) 84.19 93.33 83.84 93.25 94.51 89.82 (−1.21%)

Benchmarks: MMLU-Pro, AIME 2024 & 2025, GPQA Diamond, HumanEval, BFCL-V3. Percentages in parentheses are the average reduction relative to the original Qwen3.5-122B-A10B (BF16). This model shows the smallest average drop (−1.21%) among the compressed variants while being the smallest in size.

💾 Memory Footprint

Model Weight Size (GB) Reduction vs. BF16
Qwen3.5-122B-A10B (BF16) 250.17
Intel INT4 76.71 (−69.34%)
Qwen Official INT4 78.84 (−68.49%)
▶ Nota INT4 (this model) 69.49 (−72.22%)

Weight Size is the on-disk size of the model tensors. Reduction is relative to the original Qwen3.5-122B-A10B (BF16, 250.17 GB).

Despite removing 15% of experts and quantizing to INT4, the model keeps the smallest average quality drop (−1.21%) among compressed variants while achieving the largest memory reduction (−72.22%, 3.6× smaller) — running on less than a third of the original footprint.


📝 Citation

If you use this model or write a paper based on it, please cite the underlying Nota quantization techniques:

@article{park2026vsa,
  title   = {Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models},
  author  = {Park, Hancheol and Lee, Geonho and Piao, Tairen and Kim, Tae-Ho},
  journal = {arXiv preprint arXiv:2606.05688},
  year    = {2026},
  url     = {https://arxiv.org/abs/2606.05688}
}

@inproceedings{park2026dreammoe,
  title     = {DREAM-MoE: Downstream Routing Error-Aware Margin-Preserving Quantization for Mixture-of-Experts Large Language Models},
  author    = {Park, Hancheol and Lee, Geonho and Kim, Tae-Ho},
  booktitle = {ICML 2026 Workshop on Adaptive Foundation Models (AdaptFM)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=Wyhqwjl51A}
}

This model is a compressed derivative of Qwen3.5-122B-A10B produced by Nota AI. Please also credit the original Qwen authors when using this model.


Made with ❤️ by Nota AI

Downloads last month
15
Safetensors
Model size
22B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4

Quantized
(126)
this model

Paper for nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4