qwen-vlmbench: which Qwen VLM is the best zero-shot image→JSON labeler?
I needed to turn a large pile of images into clean, schema-conformant JSON — classification labels, bounding boxes, transcriptions, segmentation polygons, spatial relations, data-format parses — with no human in the loop and, ideally, no finetuning. The plan was to point a vision-language model at a million images and trust the output.
The only question was which model. So I built a benchmark, ran 12 Qwen VLMs across 15 vision tasks on a single RTX 6000 Pro (96 GB), and scored every model on how robustly and correctly it emits JSON. This post walks through the system, how to run it, and what the full grid says.
Preface
This requires more testing and more data to determine which is the more valid than other in this regard, as the sample sizes per task were small.
The models require an inference PER request, making them highly inefficient. A proper processor, wrapper, and substructured tiny LLM would be considerably more effective. The data to train such a tiny LLM however is sparse at best. Over time I'll be building a warchest of important elemental systems dedicated to useful prompt to data-type processing through semi-intelligent and intelligent means.
TL;DR
Qwen3.5-27Bis the quality ceiling (mean effective yield 0.549, labeler score 0.553), but it's slow (19 tok/s).27 tok/s) thanks to its MoE — so it's the better practical quality pick.Qwen3.5-35B-A3Bties it on quality (0.540 / 0.541) while running ~40% faster (Qwen3-VL-4Bis the volume champion:95% of the 27B's labeler score at ~3.8× the throughput (73 tok/s). At 4B it beats every Qwen3.5 model up through 9B.- The Qwen3-VL line is far more parameter-efficient than Qwen3.5. Within the Qwen3.5 dense line scale helps roughly linearly (4B→9B→27B = 0.42→0.49→0.55 mean yield), but VL-4B (0.50) already matches Qwen3.5-9B.
- Constrained decoding is a real validity lever, not a magic wand: a grammar lifts the schema-valid rate by 2–27 points (most for the weakest models), but it tops out at ~85–91% on the capable models because the hardest categories produce long outputs that still truncate.
- Route per task for max quality — segmentation/grounding want a Qwen3-VL model; classification, depth, OCR, and VQA want a Qwen3.5 model.
Everything is reproducible from one CLI command, with 102 CPU-only tests covering the scoring logic.
The problem: "valid JSON" is the hard part
A VLM that describes an image is easy. A VLM that emits exactly this schema, every time, with coordinates in the right space and no markdown fences is a different problem. In practice three things go wrong, and each one silently corrupts a labeling run:
- The model wraps output in
```jsonfences, prepends prose, or runs past the closing brace. - The model substitutes its own schema — ask Qwen3-VL for
{"box": [...]}and you get its native{"bbox_2d": [...]}; ask for an object and you get a bare array. - Coordinates live in a different space than your ground truth. Qwen3-VL emits boxes in
0..1000relative units; COCO is in absolute pixels. Compared naively, every IoU is wrong.
A useful benchmark has to measure all three, not just "is the answer right." That shaped the metric.
The metric: effective yield
The headline per-category number is effective yield = task_accuracy × schema_valid_rate — the fraction of all images that get a correct AND valid label.
This avoids a real trap. If you score accuracy only over the valid outputs, you reward a model for failing: a model that emits invalid JSON on the hard images and valid JSON only on the easy ones gets graded on the easy subset and looks better than it is. Counting an invalid output as a miss fixes it — and it matches reality, where an unparseable row is a lost row. (This rule applies all the way up: when a model emits zero valid outputs for a whole category, that category's effective yield is 0, and it counts as 0 in the model's mean — not omitted.)
Two more signals run on every category:
- schema-valid rate — does the output validate against the per-category Pydantic model (after a never-raises recovery walk that strips fences and finds the first balanced object)?
- JSON robustness — did it parse without structural repair? A clean object wrapped in a fence is fine (fence-stripping is deterministic); an object buried in prose is not. These are kept separate so a benign fence isn't punished like a real failure.
A composite labeler_score then multiplies accuracy by validity and robustness, so an accurate model
that emits fragile JSON ranks below a slightly-less-accurate model that emits clean JSON — the correct
ordering when you're trusting output blind. The leaderboard ranks by labeler_score; the per-category grid
shows raw effective yield. On this run the two agree on the podium.
The system
The benchmark is a small Python package (qwen_test_runner.vision) with a registry-driven core.
One registry entry per category produces three artifacts. Each category declares a tiny field registry, and the same code generation turns it into a Pydantic model (validation), a JSON Schema (tool-use), and a GBNF grammar (constrained decoding) — all at once. Adding a task is one dict entry; nothing hardcodes a field name.
# a category is just data:
_BBOX = VisionTaskSpec(
category="bbox_grounding",
fields={
"detections": _list_of("detections",
_f("label", optional=False),
_f("box", value_kind="bbox", optional=False),
_f("score", value_kind="number", optional=False, number_range=(0.0, 1.0))),
"count": _f("count", value_kind="integer", optional=False),
},
system_prompt="... output ONLY a raw JSON object ... {coord_hint} ...",
metric="detection",
coord_space=CoordSpace.NORM_0_1000,
gt_dataset="coco_detection",
)
Constrained decoding via xgrammar. Each category's GBNF grammar is compiled once and applied as a logits processor, so a bare array, a code fence, or a wrong-shape object becomes structurally impossible. This is the difference between "the model usually emits valid JSON" and "the output matches the grammar by construction" — with one honest limit, covered in the results: the grammar constrains shape, but a model can still hit the token budget mid-object on the long-output categories (polygons, 3D boxes), so constrained validity lands high but not at 100%.
A coordinate-normalization layer. Predictions and ground truth are converted to a single canonical form (pixel-absolute xyxy) before any IoU, with the model's space declared per category. This is the piece that makes detection/segmentation numbers trustworthy across models that use different conventions.
Tolerant label matching. A synonym map plus plural/word-containment so a model that says "television" isn't penalized against COCO's "tv", or "spaghetti" against food-101's "spaghetti bolognese". The richer label is arguably better; it just isn't a string match.
Ground truth: real where it's clean, synthetic where it isn't. Detection (COCO), VQA (VQAv2), classification (food-101), and OCR (TextVQA) come from public parquet datasets. The geometric tasks — spatial relations, depth ordering, subject fixation, segmentation, outline, 3D, camera — use self-contained synthetic scenes (colored shapes, rendered data-format screenshots) with exact ground truth and zero download. It isn't natural-image difficulty, but it isolates each capability cleanly and runs anywhere.
Running it
pip install -e ".[vlmbench]"
# offline sanity check (no GPU, no download):
qwen-vlmbench --runner stub --dataset smoke \
--categories image_classification bbox_grounding ocr_text
# the full array (one command — durable + resumable):
qwen-vlmbench --runner vlm --dataset full --n 12 \
--modes json_mode constrained --clear-cache-after-model \
--models qwen3.5-2b qwen3.5-4b qwen3.5-9b qwen3.5-27b qwen3.5-35b-a3b \
qwen3vl-2b qwen3vl-4b qwen3vl-8b qwen3vl-32b qwen3vl-30b-a3b \
--categories <any of the 15>
What the run writes
The orchestrator iterates model-outer and streams results to disk, so a multi-hour run survives a disconnect and resumes exactly where it stopped:
results.jsonl— one row per scored sample:{model, reasoning, category, mode, image_id, primary_score, schema_valid, json_robust, raw_text, tokens_per_sec}. Resume keys on(model, reasoning, category, mode, image_id), so re-running the command skips completed work.metrics.jsonl— per(model, category, mode)aggregates:primary_score_mean,schema_valid_rate,json_robustness,tokens_per_sec. This file is what the tables below are built from.leaderboard.md/summary.json/summary.csv— the cross-model ranking, including the native-vs-constrained validity gap and the ship/finetune bucket per model.run.log— a human-readable trace; each model's weights are deleted from the HF cache after it finishes (--clear-cache-after-model) so a 12-model sweep peaks at one model on disk, not all twelve.
The 15 tasks
| task | what it probes | ground truth | metric |
|---|---|---|---|
| image_classification | object/scene recognition | food-101 | top-1 (tolerant) |
| bbox_grounding | localization + grounded counting | COCO | mAP / IoU@0.5 F1 |
| ocr_text | reading + transcription | TextVQA | answer containment |
| data_type_differentiation | recognize a rendered format | synthetic | exact format match |
| data_type_utilization | re-serialize to JSON | synthetic | key/value F1 |
| structural_spatial_awareness | left/right/above/below | synthetic | triple F1 |
| depth_analysis | relative depth ordering | synthetic | pairwise order acc |
| subject_fixation | primary salient subject | synthetic | IoU + label |
| segmentation | instance masks (polygons) | synthetic | mIoU |
| outline_association | object outline polygon | synthetic | polygon IoU |
| geometric_3d_object_id | 3D boxes | synthetic | 3D center match |
| camera_rotational_offset | camera roll | synthetic | acc@30° |
| semantic_association | entity relations | synthetic | triple F1 |
| vit_accuracy_to_prompt | grounded VQA | VQAv2 | answer match |
| style_structural_awareness | style + structure | synthetic | style accuracy |
Results: the full 12-model ranking
Ranked by labeler_score (accuracy × validity × robustness, native json-mode). Mean yield is mean
effective yield over all 15 categories with a fully-failed category counted as 0. native→constr valid
shows the schema-valid rate in json-mode and the lift constrained decoding adds. N=12 per category.
| rank | model | params (active) | labeler | mean yield | native-valid | constr-valid | tok/s | bucket |
|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3.5-27B | 27B | 0.553 | 0.549 | 82.8% | 87.2% | ~19 | finetune-candidate |
| 2 | Qwen3.5-35B-A3B | 35B (3B) MoE | 0.541 | 0.540 | 78.9% | 85.0% | ~27 | finetune-candidate |
| 3 | Qwen3-VL-4B | 4B | 0.524 | 0.496 | 84.4% | 86.6% | ~73 | finetune-candidate |
| 4 | Qwen3-VL-32B | 32B | 0.521 | 0.502 | 80.6% | 87.3% | ~18 | finetune-candidate |
| 5 | Qwen3-VL-8B | 8B | 0.502 | 0.507 | 82.8% | 89.5% | ~57 | finetune-candidate |
| 6 | Qwen3-VL-30B-A3B | 30B (3B) MoE | 0.500 | 0.459 | 78.3% | 86.6% | ~36 | finetune-candidate |
| 7 | Qwen3.5-4B | 4B | 0.487 | 0.421 | 76.1% | 85.5% | ~49 | finetune-candidate |
| 8 | Qwen3.5-9B | 9B | 0.461 | 0.486 | 79.4% | 87.2% | ~48 | finetune-candidate |
| 9 | Qwen3-VL-2B | 2B | 0.391 | 0.348 | 82.8% | 91.1% | ~89 | insufficient |
| 10 | Qwen3.5-2B | 2B | 0.300 | 0.251 | 71.1% | 86.7% | ~66 | insufficient |
| 11 | Qwen3.5-0.8B | 0.8B | 0.139 | 0.126 | 58.3% | 76.1% | ~68 | insufficient |
| 12 | Qwen3.5-0.8B (finetuned captioner) | 0.8B | 0.123 | 0.082 | 46.7% | 73.9% | ~70 | insufficient |
Two things to read off this table before the per-category detail:
- The native→constr columns are the constrained-decoding story, quantified. The grammar lifts the schema-valid rate everywhere, but the lift is small for the strong models (VL-4B +2.2, 27B +4.4) and large for the weak ones (Qwen3.5-2B +15.6, the captioner +27.2). And it never reaches 100% — even the best constrained validity is ~91% — because the long-output categories truncate. Constraining fixes shape, not length.
- Every model is bucketed
finetune-candidateorinsufficient, because none clears the harness's 90% native-robustness bar across all 15 categories at once. That's the 2–3 genuinely hard categories (3D, segmentation) dragging the aggregate down — not a claim that the models can't emit JSON. On the other 12 categories validity is high.
Per-category effective yield
The 10 native-capable candidates, every category. (The two 0.8B baselines rank last above and are omitted
here — their rows are sparse, with several categories producing zero valid output.) n/a = the model
emitted zero schema-valid outputs for that category; in the mean it counts as 0. Bold = the best model in
that row.
| category | Q3.5-2B | Q3.5-4B | Q3.5-9B | Q3.5-27B | Q3.5-35B-A3B | VL-2B | VL-4B | VL-8B | VL-32B | VL-30B-A3B |
|---|---|---|---|---|---|---|---|---|---|---|
| classification | 0.00 | 0.33 | 0.42 | 0.42 | 0.50 | 0.08 | 0.00 | 0.17 | 0.33 | 0.42 |
| bbox_grounding | 0.16 | 0.08 | 0.25 | 0.18 | 0.20 | 0.27 | 0.30 | 0.23 | 0.18 | 0.19 |
| ocr_text | 0.18 | 0.42 | 0.50 | 0.33 | 0.42 | 0.42 | 0.42 | 0.34 | 0.33 | 0.42 |
| data_type_diff | 0.17 | 0.67 | 0.83 | 1.00 | 0.92 | 0.67 | 0.67 | 0.75 | 1.00 | 0.58 |
| data_type_util | 0.17 | 0.08 | 0.08 | 0.17 | 0.17 | 0.00 | 0.67 | 0.08 | 0.17 | 0.17 |
| spatial relations | 0.76 | 0.00 | 0.83 | 0.71 | 0.81 | 0.65 | 0.80 | 0.82 | 0.65 | 1.00 |
| depth ordering | 0.17 | 0.81 | 1.00 | 1.00 | 1.00 | 0.14 | 0.50 | 0.81 | 1.00 | 1.00 |
| subject_fixation | n/a | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| segmentation | 0.00 | 0.00 | 0.00 | 0.39 | 0.16 | 0.01 | 0.42 | 0.51 | 0.22 | 0.00 |
| outline | 0.00 | 0.33 | 0.00 | 0.17 | 0.00 | 0.00 | 0.00 | 0.33 | n/a | 0.12 |
| 3D object id | n/a | n/a | 0.00 | 0.00 | n/a | 0.00 | n/a | 0.03 | 0.00 | n/a |
| camera roll | 0.53 | 0.56 | 0.78 | 0.81 | 0.81 | 0.75 | 0.75 | 0.69 | 0.58 | 0.67 |
| VQA | 0.75 | 0.92 | 0.50 | 0.92 | 1.00 | 0.25 | 0.58 | 0.67 | 0.92 | 0.50 |
| semantic relations | 0.46 | 0.78 | 0.77 | 0.81 | 0.78 | 0.56 | 0.75 | 0.60 | 0.82 | 0.64 |
| style | 0.42 | 0.33 | 0.33 | 0.33 | 0.33 | 0.42 | 0.58 | 0.58 | 0.33 | 0.17 |
| mean (15, n/a=0) | 0.251 | 0.421 | 0.486 | 0.549 | 0.540 | 0.348 | 0.496 | 0.507 | 0.502 | 0.459 |
What the grid says
Scale helps — but architecture helps more. Inside the Qwen3.5 dense line, quality climbs roughly linearly with size: 4B → 9B → 27B is 0.421 → 0.486 → 0.549 mean yield. So if you've committed to Qwen3.5, bigger genuinely is better. But the Qwen3-VL line gets there far cheaper: VL-4B (0.496) already matches Qwen3.5-9B (0.486) and beats Qwen3.5-4B (0.421) by 0.075 — at the same or smaller parameter count. The grounding-heavy training of the VL line is worth more than a couple of billion extra parameters.
The quality ceiling is ~0.55, and two models share it. Qwen3.5-27B (0.549) is the top scorer, but Qwen3.5-35B-A3B ties it within noise (0.540) and decodes ~40% faster (~27 vs ~19 tok/s) because only ~3B of its parameters are active per token. For a quality-first pipeline that still needs throughput, the MoE is the pick; the dense 27B is the pick only if you want the single highest number.
Qwen3-VL-4B is the volume winner. It reaches 95% of the 27B's labeler score (0.524 vs 0.553) and ~97%
of the 35B-A3B's (0.524 vs 0.541), at **73 tok/s — 3.8× the 27B's throughput and 2.7× the 35B-A3B's.** It's
also the most balanced small model: it's the only sub-8B model that posts a real number on both
data-format→JSON (0.67) and segmentation (0.42), while staying strong on the language tasks. For labeling at
scale, this is the one.
The two 0.8B models are genuinely insufficient (0.126 and 0.082). The finetuned captioner is worse on this generic benchmark than the base 0.8B — because it was tuned to its own schema, so it reliably emits the wrong shape here, and its constrained validity (73.9%) is the lowest in the fleet. That's the expected cost of a narrow finetune evaluated off-distribution. It says "a schema-specific finetune doesn't generalize to other schemas" — not that finetuning is useless; a finetune matched to your schema is a different experiment this run doesn't cover.
Capability is task-specific, so route per task. No single model wins everywhere:
| task | best model | yield |
|---|---|---|
| classification | Qwen3.5-35B-A3B | 0.50 |
| OCR | Qwen3.5-9B | 0.50 |
| depth ordering | five-way tie (9B / 27B / 35B-A3B / VL-32B / VL-30B-A3B) | 1.00 |
| data-format recognition | Qwen3.5-27B / VL-32B | 1.00 |
| data-format → JSON | Qwen3-VL-4B | 0.67 |
| segmentation | Qwen3-VL-8B | 0.51 |
| outline | Qwen3-VL-8B / Q3.5-4B | 0.33 |
| spatial relations | Qwen3-VL-30B-A3B | 1.00 |
| VQA | Qwen3.5-35B-A3B | 1.00 |
| semantic relations | Qwen3-VL-32B | 0.82 |
| style | Qwen3-VL-4B / 8B | 0.58 |
The pixel-grounding tasks (segmentation, outline) belong to the Qwen3-VL family; the language/structured tasks (classification, depth, OCR, VQA) lean Qwen3.5. A two-model router — a Qwen3-VL model for grounding, a Qwen3.5 model for the rest — maximizes dataset quality. If you want one model, Qwen3-VL-4B is the best all-round value and Qwen3.5-27B / 35B-A3B the quality ceiling.
Honest caveats
- The geometric tasks use synthetic ground truth. Colored-shape scenes and rendered data-format screenshots isolate each capability and are fully reproducible, but they are not natural-image difficulty. Real-image variants (COCO masks, NYU depth, SUN-RGBD 3D) are the obvious next step.
- 3D is unsolved here — every model is ~0.00. Recovering a 3D box from a single 2D image is genuinely hard, and the synthetic proxy is crude. This is the one task that clearly needs real 3D data or a finetune.
- The camera task is easy to over-credit. Its synthetic ground truth fixes yaw and pitch at 0 and only
varies roll, so a model that outputs
[0, 0, guess]gets two of three axes for free. Read those numbers as "can it estimate roll," not full 6-DoF pose. - No model clears the 90% native-robustness bar across all 15 tasks, so every model buckets as "finetune-candidate." That's the strict threshold reacting to the 2–3 hard categories — not a statement that the models can't do JSON. Constrained decoding raises validity (the native→constr columns) but stops at ~85–91% because long polygon/3D outputs truncate, so the bar isn't cleared by constraining alone.
- Throughput is measured on
transformers, not vLLM (which would be faster). N=12 per category is a characterization run, not a leaderboard-grade N.
Reproduce
pip install -e ".[vlmbench]"
qwen-vlmbench --runner vlm --dataset full --n 12 \
--modes json_mode constrained --clear-cache-after-model \
--models qwen3vl-4b qwen3.5-27b \
--categories segmentation bbox_grounding semantic_association
The full harness — registry-driven schemas, GBNF generation, the coordinate-normalization layer, tolerant
label matching, the synthetic ground-truth generators, and the durable orchestrator — is in
qwen_test_runner/vision/, with 102 CPU-only tests so the scoring logic is verifiable without a GPU.
Takeaways
- Measure effective yield. Accuracy-over-valid-only rewards models for failing on hard inputs; counting invalid (and a fully-failed category) as a miss is the honest metric, and it changes the ranking.
- Constrain for shape, budget for length. A grammar makes wrong-shape JSON impossible and lifts validity 2–27 points, but it can't stop a long polygon from truncating — so the hard categories still need either bigger token budgets or a finetune.
- You don't need a giant model. The quality curve tops out near 0.55; Qwen3-VL-4B delivers ~95% of that at ~3.8× the throughput, and Qwen3.5-35B-A3B is the fast-MoE route to the ceiling.
- Route per task. Qwen3-VL for grounding/segmentation, Qwen3.5 for language/structured — no single model wins all 15.
If you're building structured-output datasets from images, you can skip the 70B: point Qwen3-VL-4B (or Qwen3.5-27B/35B-A3B for the quality ceiling) at your images with a grammar, route the grounding tasks to the VL line, and you have a production image→JSON labeler today.