MoQ: Mixture of Quants

image

๐Ÿš€ MoQ: Mixture of Quants

MoQ (Mixture of Quants) is a smart way to shrink AI models without losing their "brainpower." Unlike old methods that treat every part of the model the same, MoQ identifies the most important parts and keeps them high-quality, while heavily compressing the rest to save space.**Stop settling for uniform bitrates. Standard quantization is a relic of the past, treating vital cognitive weights the same as redundant noise. **


The result? A model that punches significantly above its weight class.

Comparison

Here is the comparison between MoQ and Jackrong's quants for his model. MoQ perform better by such a big margin that you can save a GB for same performace . All 3 metrics prove how better they are :

image

image

image

Background evaluations:

Benjamin Marie evaluated MoQ GGUFs ("Mixture of Quants") against Unsloth Dynamic (UD) quants on original Qwen 3.5 9B, focusing on low-bit versions below 4 bits on average โ€” the range where GGUF models typically struggle most. Results: At similar bits-per-weight (Bpw), MoQ outperforms Unsloth Dynamic quants by ~10% on benchmarks, while also being roughly 2ร— more token-efficient on average.

"MoQ models are much better than UD quants on benchmarks, and they are also more token-efficient."

image

image


Files

Folder Link BPW Total Size Description
๐Ÿ“‚ MoQ-Quants 3.3 3.83 GB
๐Ÿ“‚ MoQ-Quants 3.7 4.28 GB
๐Ÿ“‚ MoQ-Quants 3.9 4.47 GB
๐Ÿ“‚ MoQ-Quants 4.2 4.89 GB
๐Ÿ“‚ MoQ-Quants 4.4 5.09 GB
๐Ÿ“‚ MoQ-Quants 4.7 5.36 GB
๐Ÿ“‚ MoQ-Quants 4.9 5.62 GB
๐Ÿ“‚ MoQ-Quants 5.0 5.74 GB
๐Ÿ“‚ MoQ-Quants 5.2 6.00 GB
๐Ÿ“‚ MoQ-Quants 5.4 6.17 GB
๐Ÿ“‚ MoQ-Quants 6.6 7.62 GB

This is the MTP Repo . MTP speculative decoding is for faster generation

๐Ÿง  The MoQ Edge

MoQ optimizes the architecture for the Pareto frontier of memory and performance.

  • Dynamic Bitrate Allocation: No more "one-size-fits-all." MoQ assigns precision where it actually matters.
  • Cognitive Preservation: Massive VRAM savings with near-zero degradation in logic and coherence.
  • Next-Gen Efficiency: Fits "Large" model intelligence into "Small" model hardware.

x : https://x.com/WaleedAhmad1a10 If MoQ does not perform well, email me : waleedahmad.1a10@gmail.com

๐Ÿ›  Usage & Deployment.

./llama-cli -m Qwen3.5-9B-MoQ-4.85.gguf -p "The future of efficient AI is..."
Downloads last month
23,895
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for w-ahmad/Qwopus3.5-9B-Coder-MTP-GGUF-MoQ

Finetuned
Qwen/Qwen3.5-9B
Quantized
(272)
this model