Qwen3.5-35B-A3B β VKAE Accelerated
Ready-to-run, VKAE-accelerated serving of Qwen3.5-35B-A3B (35B-parameter Mixture-of-Experts, ~3B active). Ships as a self-contained container β model weights and an optimized serving runtime in a single image β so anyone can reproduce the numbers on their own GPU.
VKAE (VIDRAFT Kernel Acceleration Engine) is VIDRAFT's proprietary inference-serving optimization. The acceleration recipe is withheld; only the reproducible results are published here.
Measured performance
NVIDIA B200, single GPU, FP8, same-harness before/after.
| Metric | Baseline | VKAE | Gain |
|---|---|---|---|
| Single-stream throughput | 25.7 tok/s | 601 tok/s | 23.4Γ |
| Peak aggregate (high concurrency) | β | ~10,516 tok/s | β |
| Output quality | reference | preserved | no degradation |
Realistic varied-content single-stream throughput sits around ~455 tok/s. Accuracy is preserved end to end.
Quick start
docker pull vidraft/qwen35-vkae:601
docker run --gpus all -p 8000:8000 vidraft/qwen35-vkae:601
The container serves an OpenAI-compatible API on port 8000 β point any OpenAI client at http://localhost:8000/v1. A Blackwell (B200) or Hopper (H100/H200) class GPU is recommended.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen35-vkae","messages":[{"role":"user","content":"Hello!"}]}'
π¦ Ready-to-use files in this repo:
Dockerfile,docker-compose.yml,run_docker.shβ pull-and-run, no build required.
Links
- Live acceleration leaderboard β VIDraft/vkae
- Docker image β hub.docker.com/r/vidraft/qwen35-vkae
- Collection β FINAL-Bench Β· VKAE Accelerated
Base model & license
Base weights: Qwen/Qwen3.5-35B-A3B (Apache-2.0), unmodified. This card documents VIDRAFT's accelerated serving of the model; the base model itself is unchanged. The VKAE acceleration method is proprietary and is not distributed in source form.