Instructions to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4") model = AutoModelForMultimodalLM.from_pretrained("nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4
- SGLang
How to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 with Docker Model Runner:
docker model run hf.co/nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4
Qwen3.5-122B-A10B-NotaCompression-INT4
Nota AI compressed Qwen3.5-122B-A10B — a Mixture-of-Experts (MoE) LLM shrunk with MoE-aware INT4 quantization and global expert pruning, retaining near-original quality while running comfortably on a single H100.
250.17 GB → 69.49 GB (−72.22%) · 3.6× smaller
98.79% performance retained (avg. of 5 reasoning benchmarks)
📌 Highlights
- MoE-specialized quantization — INT4 weight quantization tuned for the MoE structure, minimizing accuracy loss on MoE layers. (Method (1) ↗, Method (2) ↗)
- Global expert-sensitivity pruning (15%) — instead of conventional uniform pruning that removes the same number of experts from every block, Nota measures a model-wide expert sensitivity score and prunes experts according to their true global importance. The most expendable experts are removed wherever they are, so blocks end up keeping different numbers of experts — far more favorable to quality preservation than uniform cuts.
- Runs on a single H100 — most INT4-only quantized MoE models on the Hub still cannot fit on one H100, but this compressed model serves on a single H100 (80 GB) — and scales to higher throughput / longer context on 2 GPUs.
- Quality retained — 98.79% of the BF16 baseline retained on average (5 reasoning benchmarks), within ~1–2 points across knowledge, math, reasoning, coding, and agentic tasks.
🧠 About Qwen3.5
Qwen3.5-122B-A10B is a large Mixture-of-Experts language model: it has ~122B total parameters but activates only ~10B per token by routing each token to a small subset of experts. This gives the capacity of a very large model at the inference cost of a much smaller one, with strong performance across reasoning, math, coding, and tool use.
This repository provides a compressed variant produced by Nota AI's compression pipeline.
🗜️ What Nota Compression Does
| Stage | Technique | Effect |
|---|---|---|
| Quantization | MoE-aware INT4 | Weights packed to 4-bit; expert layers quantized with MoE-specific calibration |
| Pruning | Global expert-sensitivity pruning, 15% removed | Experts removed by model-wide importance score, not a fixed per-block quota |
Unlike uniform pruning that removes a fixed number of experts from every block, Nota's method scores each expert by its global sensitivity across the whole model and removes only the most expendable ones. As a result different blocks retain a different number of experts — a non-uniform layout that preserves quality far better. The custom model file shipped here (see Patch vLLM) is required to support this non-uniform expert layout.
🚀 Usage
Environment
Install into a uv environment.
uv venv
uv pip install vllm==0.22.0
Required: vLLM 0.22.0
Patch vLLM (required)
This model uses a different number of experts per block. To support that layout, replace vLLM's model definition with the file provided in this repo:
cp patch/qwen3_5.py /path/to/vllm/model_executor/models/qwen3_5.py
🖥️ Serving with vLLM
Standard (H100 × 2)
vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--reasoning-parser qwen3
With tool calling
vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Single GPU (H100 × 1)
The following settings run comfortably on a single H100:
vllm serve nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4 \
--tensor-parallel-size 1 \
--max-model-len 65536 \
--max-num-seqs 96 \
--gpu-memory-utilization 0.93
💡 On a single 80 GB GPU, KV-cache is the main constraint. If you hit
max_num_seqs exceeds available Mamba cache blocks, lower--max-num-seqsor reduce--max-model-lento free cache.
📊 Benchmark Performance
| Model | MMLU-Pro (Knowledge) |
AIME 24&25 (Math) |
GPQA Diamond (STEM/Reasoning) |
HumanEval (Coding) |
BFCL-V3 (Agent) |
Average |
|---|---|---|---|---|---|---|
| Qwen3.5-122B-A10B (BF16) | 86.42 | 93.33 | 85.35 | 94.51 | 95.00 | 90.92 |
| Intel INT4 | 85.97 | 91.67 | 82.32 | 93.90 | 93.33 | 89.44 (−1.63%) |
| Qwen Official INT4 | 85.92 | 93.33 | 84.34 | 89.63 | 93.42 | 89.33 (−1.75%) |
| ▶ Nota INT4 (this model) | 84.19 | 93.33 | 83.84 | 93.25 | 94.51 | 89.82 (−1.21%) |
Benchmarks: MMLU-Pro, AIME 2024 & 2025, GPQA Diamond, HumanEval, BFCL-V3. Percentages in parentheses are the average reduction relative to the original Qwen3.5-122B-A10B (BF16). This model shows the smallest average drop (−1.21%) among the compressed variants while being the smallest in size.
💾 Memory Footprint
| Model | Weight Size (GB) | Reduction vs. BF16 |
|---|---|---|
| Qwen3.5-122B-A10B (BF16) | 250.17 | — |
| Intel INT4 | 76.71 | (−69.34%) |
| Qwen Official INT4 | 78.84 | (−68.49%) |
| ▶ Nota INT4 (this model) | 69.49 | (−72.22%) |
Weight Size is the on-disk size of the model tensors. Reduction is relative to the original Qwen3.5-122B-A10B (BF16, 250.17 GB).
Despite removing 15% of experts and quantizing to INT4, the model keeps the smallest average quality drop (−1.21%) among compressed variants while achieving the largest memory reduction (−72.22%, 3.6× smaller) — running on less than a third of the original footprint.
📝 Citation
If you use this model or write a paper based on it, please cite the underlying Nota quantization techniques:
@article{park2026vsa,
title = {Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models},
author = {Park, Hancheol and Lee, Geonho and Piao, Tairen and Kim, Tae-Ho},
journal = {arXiv preprint arXiv:2606.05688},
year = {2026},
url = {https://arxiv.org/abs/2606.05688}
}
@inproceedings{park2026dreammoe,
title = {DREAM-MoE: Downstream Routing Error-Aware Margin-Preserving Quantization for Mixture-of-Experts Large Language Models},
author = {Park, Hancheol and Lee, Geonho and Kim, Tae-Ho},
booktitle = {ICML 2026 Workshop on Adaptive Foundation Models (AdaptFM)},
year = {2026},
url = {https://openreview.net/forum?id=Wyhqwjl51A}
}
This model is a compressed derivative of Qwen3.5-122B-A10B produced by Nota AI. Please also credit the original Qwen authors when using this model.
Made with ❤️ by Nota AI
- Downloads last month
- 15
Model tree for nota-ai/Qwen3.5-122B-A10B-NotaCompression-INT4
Base model
Qwen/Qwen3.5-122B-A10B