Instructions to use laywens/ZwZ-8B-VL-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use laywens/ZwZ-8B-VL-MLX-8bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("laywens/ZwZ-8B-VL-MLX-8bit") config = load_config("laywens/ZwZ-8B-VL-MLX-8bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use laywens/ZwZ-8B-VL-MLX-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "laywens/ZwZ-8B-VL-MLX-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "laywens/ZwZ-8B-VL-MLX-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use laywens/ZwZ-8B-VL-MLX-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "laywens/ZwZ-8B-VL-MLX-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default laywens/ZwZ-8B-VL-MLX-8bit
Run Hermes
hermes
ZwZ-8B-VL-MLX-8bit
This is an 8-bit quantized MLX conversion of inclusionAI/ZwZ-8B, optimized for Apple Silicon inference using the MLX framework.
ZwZ-8B is a fine-grained multimodal perception model built on Qwen3-VL-8B, trained using Region-to-Image Distillation (R2I) combined with reinforcement learning. It achieves state-of-the-art fine-grained visual understanding in a single forward pass — no inference-time zooming or tool calling required.
The 8-bit variant offers higher fidelity than the 4-bit version at the cost of roughly 1.5x memory usage, and may be preferable for tasks requiring maximum accuracy on fine visual details.
Conversion Details
| Setting | Value |
|---|---|
| Source model | inclusionAI/ZwZ-8B |
| Conversion tool | mlx_vlm.convert (via mlx-vlm) |
| Quantization bits | 8-bit |
| Group size | 64 |
| Quantization method | Affine post-training quantization (PTQ) |
| Quant predicate | None (uniform quantization across all text/LLM layers) |
| DWQ / AWQ | Not used |
Quantized Layers
Only the language model / text decoder layers are quantized. The following module paths are excluded from quantization and remain at their original precision:
vision_model, vision_tower, vl_connector, sam_model, audio_model, audio_tower, code_predictor
Performance
Benchmarked on Apple M2 Max, 96 GB unified memory.
Text Generation (mlx_vlm.generate)
| Metric | Value |
|---|---|
| Prompt tok/s | 48.9 |
| Generation tok/s | 37.0 |
| Peak memory | 11.65 GB |
Vision Inference by Resolution (vllm-mlx-bench)
| Resolution | Tok/s | Memory (GB) |
|---|---|---|
| 224×224 | 30.7 | 11.17 |
| 448×448 | 26.2 | 11.94 |
| 768×768 | 19.2 | 12.67 |
| 1024×1024 | 14.5 | 14.64 |
Validation
| Test | Status |
|---|---|
| Text generation | ✅ |
| Image + text generation | ✅ |
| vllm-mlx serving | ✅ |
Usage
Installation
pip install -U mlx-vlm
MLX-VLM CLI
python -m mlx_vlm.generate \
--model swaylenhayes/ZwZ-8B-VL-MLX-8bit \
--max-tokens 512 \
--temperature 0.0 \
--prompt "Describe this image in detail." \
--image path/to/image.png
Python API
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "swaylenhayes/ZwZ-8B-VL-MLX-8bit"
model, processor = load(model_path)
config = load_config(model_path)
prompt = apply_chat_template(
processor,
config,
"List every interactive UI element visible in this screenshot.",
num_images=1,
)
output = generate(
model,
processor,
prompt,
image="path/to/screenshot.png",
max_tokens=512,
temperature=0.0,
)
print(output)
vLLM-MLX (OpenAI-compatible server)
vllm-mlx serve swaylenhayes/ZwZ-8B-VL-MLX-8bit --host 127.0.0.1 --port 8108
curl http://127.0.0.1:8108/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Reply with OK"}],
"max_tokens": 16
}'
About ZwZ (Zooming without Zooming)
ZwZ transforms zooming from an inference-time tool into a training-time primitive:
- Zoom in to micro-cropped regions and let strong teacher models (Qwen3-VL-235B, GLM-4.5V) generate high-quality VQA data
- Distill this region-grounded supervision back to the full image with explicit bounding-box overlays
- Reinforce via RL training to enable single-glance fine-grained perception
This makes ZwZ particularly well-suited for tasks requiring fine visual detail recognition, such as UI screenshot parsing, document analysis, and dense image understanding.
Links
- Original model: inclusionAI/ZwZ-8B
- Base architecture: Qwen/Qwen3-VL-8B-Instruct
- Paper: Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
- Project: github.com/inclusionAI/Zooming-without-Zooming
- Training data: inclusionAI/ZwZ-RL-VQA
- MLX framework: github.com/ml-explore/mlx
- mlx-vlm: github.com/Blaizzy/mlx-vlm
Other Quantizations
| Variant | Link |
|---|---|
| ZwZ-8B MLX 4-bit | swaylenhayes/ZwZ-8B-VL-MLX-4bit |
| ZwZ-8B MLX 8-bit | this model |
| ZwZ-4B MLX 4-bit | swaylenhayes/ZwZ-4B-VL-MLX-4bit |
| ZwZ-4B MLX 8-bit | swaylenhayes/ZwZ-4B-VL-MLX-8bit |
Notes and Limitations
- Quantization changes numerical behavior relative to full-precision weights. Performance may differ from the original model on edge cases.
- Throughput and memory depend on prompt length, image resolution, and runtime settings.
- Benchmark numbers reflect a quiet system with no other models loaded.
Citation
@article{wei2026zooming,
title={Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception},
author={Wei, Lai and He, Liangbo and Lan, Jun and Dong, Lingzhong and Cai, Yutong and Li, Siyuan and Zhu, Huijia and Wang, Weiqiang and Kong, Linghe and Wang, Yue and Zhang, Zhuosheng and Huang, Weiran},
journal={arXiv preprint arXiv:2602.11858},
year={2026}
}
License
Apache 2.0 — follows the license of the original ZwZ and Qwen3-VL models.
- Downloads last month
- 11
8-bit