Instructions to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4")
model = AutoModelForMultimodalLM.from_pretrained("nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4

SGLang

How to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with Docker Model Runner:
```
docker model run hf.co/nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4
```

Qwen3.5-122B-A10B-NotaCompression-INT4

Nota AI compressed Qwen3.5-122B-A10B — a Mixture-of-Experts (MoE) LLM shrunk with MoE-aware INT4 quantization and global expert pruning, retaining near-original quality while running comfortably on a single H100.

250.17 GB → 69.49 GB (−72.22%) · 3.6× smaller
98.79% performance retained (avg. of 5 reasoning benchmarks)

📌 Highlights

MoE-specialized quantization — INT4 weight quantization tuned for the MoE structure, minimizing accuracy loss on MoE layers. (Method (1) ↗, Method (2) ↗)
Global expert-sensitivity pruning (15%) — instead of conventional uniform pruning that removes the same number of experts from every block, Nota measures a model-wide expert sensitivity score and prunes experts according to their true global importance. The most expendable experts are removed wherever they are, so blocks end up keeping different numbers of experts — far more favorable to quality preservation than uniform cuts.
Runs on a single H100 — most INT4-only quantized MoE models on the Hub still cannot fit on one H100, but this compressed model serves on a single H100 (80 GB) — and scales to higher throughput / longer context on 2 GPUs.
Quality retained — 98.79% of the BF16 baseline retained on average (5 reasoning benchmarks), within ~1–2 points across knowledge, math, reasoning, coding, and agentic tasks.

🧠 About Qwen3.5

Qwen3.5-122B-A10B is a large Mixture-of-Experts language model: it has ~122B total parameters but activates only ~10B per token by routing each token to a small subset of experts. This gives the capacity of a very large model at the inference cost of a much smaller one, with strong performance across reasoning, math, coding, and tool use.

This repository provides a compressed variant produced by Nota AI's compression pipeline.

🗜️ What Nota Compression Does

Stage	Technique	Effect
Quantization	MoE-aware INT4	Weights packed to 4-bit; expert layers quantized with MoE-specific calibration
Pruning	Global expert-sensitivity pruning, 15% removed	Experts removed by model-wide importance score, not a fixed per-block quota

Unlike uniform pruning that removes a fixed number of experts from every block, Nota's method scores each expert by its global sensitivity across the whole model and removes only the most expendable ones. As a result different blocks retain a different number of experts — a non-uniform layout that preserves quality far better. The custom model file shipped here (see Patch vLLM) is required to support this non-uniform expert layout.

🚀 Usage

Environment

Install into a uv environment.

uv venv
uv pip install vllm==0.22.0

Required: vLLM 0.22.0

Patch vLLM (required)

This model uses a different number of experts per block. To support that layout, replace vLLM's model definition with the file provided in this repo:

cp patch/qwen3_5.py /path/to/vllm/model_executor/models/qwen3_5.py

🖥️ Serving with vLLM

Standard (H100 × 2)

vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

With tool calling

vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Single GPU (H100 × 1)

The following settings run comfortably on a single H100:

vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --max-num-seqs 96 \
  --gpu-memory-utilization 0.93

💡 On a single 80 GB GPU, KV-cache is the main constraint. If you hit max_num_seqs exceeds available Mamba cache blocks, lower --max-num-seqs or reduce --max-model-len to free cache.

📊 Benchmark Performance

Model	MMLU-Pro (Knowledge)	AIME 24&25 (Math)	GPQA Diamond (STEM/Reasoning)	HumanEval (Coding)	BFCL-V3 (Agent)	Average
Qwen3.5-122B-A10B (BF16)	86.42	93.33	85.35	94.51	95.00	90.92
Intel INT4	85.97	91.67	82.32	93.90	93.33	89.44 (−1.63%)
Qwen Official INT4	85.92	93.33	84.34	89.63	93.42	89.33 (−1.75%)
▶ Nota INT4 (this model)	84.19	93.33	83.84	93.25	94.51	89.82 (−1.21%)

Benchmarks: MMLU-Pro, AIME 2024 & 2025, GPQA Diamond, HumanEval, BFCL-V3. Percentages in parentheses are the average reduction relative to the original Qwen3.5-122B-A10B (BF16). This model shows the smallest average drop (−1.21%) among the compressed variants while being the smallest in size.

💾 Memory Footprint

Model	Weight Size (GB)	Reduction vs. BF16
Qwen3.5-122B-A10B (BF16)	250.17	—
Intel INT4	76.71	(−69.34%)
Qwen Official INT4	78.84	(−68.49%)
▶ Nota INT4 (this model)	69.49	(−72.22%)

Weight Size is the on-disk size of the model tensors. Reduction is relative to the original Qwen3.5-122B-A10B (BF16, 250.17 GB).

Despite removing 15% of experts and quantizing to INT4, the model keeps the smallest average quality drop (−1.21%) among compressed variants while achieving the largest memory reduction (−72.22%, 3.6× smaller) — running on less than a third of the original footprint.

📝 Citation

If you use this model or write a paper based on it, please cite the underlying Nota quantization techniques:

@article{park2026vsa,
  title   = {Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models},
  author  = {Park, Hancheol and Lee, Geonho and Piao, Tairen and Kim, Tae-Ho},
  journal = {arXiv preprint arXiv:2606.05688},
  year    = {2026},
  url     = {https://arxiv.org/abs/2606.05688}
}

@inproceedings{park2026dreammoe,
  title     = {DREAM-MoE: Downstream Routing Error-Aware Margin-Preserving Quantization for Mixture-of-Experts Large Language Models},
  author    = {Park, Hancheol and Lee, Geonho and Kim, Tae-Ho},
  booktitle = {ICML 2026 Workshop on Adaptive Foundation Models (AdaptFM)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=Wyhqwjl51A}
}

This model is a compressed derivative of Qwen3.5-122B-A10B produced by Nota AI. Please also credit the original Qwen authors when using this model.

Made with ❤️ by Nota AI

Downloads last month: 15

Safetensors

Model size

22B params

Tensor type

I32

BF16

F16

Model tree for nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(126)

this model

Paper for nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

Paper • 2606.05688 • Published 27 days ago