qwen-vlmbench: which Qwen VLM is the best zero-shot image→JSON labeler?

I needed to turn a large pile of images into clean, schema-conformant JSON — classification labels, bounding boxes, transcriptions, segmentation polygons, spatial relations, data-format parses — with no human in the loop and, ideally, no finetuning. The plan was to point a vision-language model at a million images and trust the output.

The only question was which model. So I built a benchmark, ran 12 Qwen VLMs across 15 vision tasks on a single RTX 6000 Pro (96 GB), and scored every model on how robustly and correctly it emits JSON. This post walks through the system, how to run it, and what the full grid says.

Preface

This requires more testing and more data to determine which is the more valid than other in this regard, as the sample sizes per task were small.

The models require an inference PER request, making them highly inefficient. A proper processor, wrapper, and substructured tiny LLM would be considerably more effective. The data to train such a tiny LLM however is sparse at best. Over time I'll be building a warchest of important elemental systems dedicated to useful prompt to data-type processing through semi-intelligent and intelligent means.

TL;DR

  • Qwen3.5-27B is the quality ceiling (mean effective yield 0.549, labeler score 0.553), but it's slow (19 tok/s). Qwen3.5-35B-A3B ties it on quality (0.540 / 0.541) while running ~40% faster (27 tok/s) thanks to its MoE — so it's the better practical quality pick.
  • Qwen3-VL-4B is the volume champion: 95% of the 27B's labeler score at ~3.8× the throughput (73 tok/s). At 4B it beats every Qwen3.5 model up through 9B.
  • The Qwen3-VL line is far more parameter-efficient than Qwen3.5. Within the Qwen3.5 dense line scale helps roughly linearly (4B→9B→27B = 0.42→0.49→0.55 mean yield), but VL-4B (0.50) already matches Qwen3.5-9B.
  • Constrained decoding is a real validity lever, not a magic wand: a grammar lifts the schema-valid rate by 2–27 points (most for the weakest models), but it tops out at ~85–91% on the capable models because the hardest categories produce long outputs that still truncate.
  • Route per task for max quality — segmentation/grounding want a Qwen3-VL model; classification, depth, OCR, and VQA want a Qwen3.5 model.

Everything is reproducible from one CLI command, with 102 CPU-only tests covering the scoring logic.


The problem: "valid JSON" is the hard part

A VLM that describes an image is easy. A VLM that emits exactly this schema, every time, with coordinates in the right space and no markdown fences is a different problem. In practice three things go wrong, and each one silently corrupts a labeling run:

  1. The model wraps output in ```json fences, prepends prose, or runs past the closing brace.
  2. The model substitutes its own schema — ask Qwen3-VL for {"box": [...]} and you get its native {"bbox_2d": [...]}; ask for an object and you get a bare array.
  3. Coordinates live in a different space than your ground truth. Qwen3-VL emits boxes in 0..1000 relative units; COCO is in absolute pixels. Compared naively, every IoU is wrong.

A useful benchmark has to measure all three, not just "is the answer right." That shaped the metric.


The metric: effective yield

The headline per-category number is effective yield = task_accuracy × schema_valid_rate — the fraction of all images that get a correct AND valid label.

This avoids a real trap. If you score accuracy only over the valid outputs, you reward a model for failing: a model that emits invalid JSON on the hard images and valid JSON only on the easy ones gets graded on the easy subset and looks better than it is. Counting an invalid output as a miss fixes it — and it matches reality, where an unparseable row is a lost row. (This rule applies all the way up: when a model emits zero valid outputs for a whole category, that category's effective yield is 0, and it counts as 0 in the model's mean — not omitted.)

Two more signals run on every category:

  • schema-valid rate — does the output validate against the per-category Pydantic model (after a never-raises recovery walk that strips fences and finds the first balanced object)?
  • JSON robustness — did it parse without structural repair? A clean object wrapped in a fence is fine (fence-stripping is deterministic); an object buried in prose is not. These are kept separate so a benign fence isn't punished like a real failure.

A composite labeler_score then multiplies accuracy by validity and robustness, so an accurate model that emits fragile JSON ranks below a slightly-less-accurate model that emits clean JSON — the correct ordering when you're trusting output blind. The leaderboard ranks by labeler_score; the per-category grid shows raw effective yield. On this run the two agree on the podium.


The system

The benchmark is a small Python package (qwen_test_runner.vision) with a registry-driven core.

One registry entry per category produces three artifacts. Each category declares a tiny field registry, and the same code generation turns it into a Pydantic model (validation), a JSON Schema (tool-use), and a GBNF grammar (constrained decoding) — all at once. Adding a task is one dict entry; nothing hardcodes a field name.

# a category is just data:
_BBOX = VisionTaskSpec(
    category="bbox_grounding",
    fields={
        "detections": _list_of("detections",
            _f("label", optional=False),
            _f("box", value_kind="bbox", optional=False),
            _f("score", value_kind="number", optional=False, number_range=(0.0, 1.0))),
        "count": _f("count", value_kind="integer", optional=False),
    },
    system_prompt="... output ONLY a raw JSON object ... {coord_hint} ...",
    metric="detection",
    coord_space=CoordSpace.NORM_0_1000,
    gt_dataset="coco_detection",
)

Constrained decoding via xgrammar. Each category's GBNF grammar is compiled once and applied as a logits processor, so a bare array, a code fence, or a wrong-shape object becomes structurally impossible. This is the difference between "the model usually emits valid JSON" and "the output matches the grammar by construction" — with one honest limit, covered in the results: the grammar constrains shape, but a model can still hit the token budget mid-object on the long-output categories (polygons, 3D boxes), so constrained validity lands high but not at 100%.

A coordinate-normalization layer. Predictions and ground truth are converted to a single canonical form (pixel-absolute xyxy) before any IoU, with the model's space declared per category. This is the piece that makes detection/segmentation numbers trustworthy across models that use different conventions.

Tolerant label matching. A synonym map plus plural/word-containment so a model that says "television" isn't penalized against COCO's "tv", or "spaghetti" against food-101's "spaghetti bolognese". The richer label is arguably better; it just isn't a string match.

Ground truth: real where it's clean, synthetic where it isn't. Detection (COCO), VQA (VQAv2), classification (food-101), and OCR (TextVQA) come from public parquet datasets. The geometric tasks — spatial relations, depth ordering, subject fixation, segmentation, outline, 3D, camera — use self-contained synthetic scenes (colored shapes, rendered data-format screenshots) with exact ground truth and zero download. It isn't natural-image difficulty, but it isolates each capability cleanly and runs anywhere.

Running it

pip install -e ".[vlmbench]"

# offline sanity check (no GPU, no download):
qwen-vlmbench --runner stub --dataset smoke \
  --categories image_classification bbox_grounding ocr_text

# the full array (one command — durable + resumable):
qwen-vlmbench --runner vlm --dataset full --n 12 \
  --modes json_mode constrained --clear-cache-after-model \
  --models qwen3.5-2b qwen3.5-4b qwen3.5-9b qwen3.5-27b qwen3.5-35b-a3b \
           qwen3vl-2b qwen3vl-4b qwen3vl-8b qwen3vl-32b qwen3vl-30b-a3b \
  --categories <any of the 15>

What the run writes

The orchestrator iterates model-outer and streams results to disk, so a multi-hour run survives a disconnect and resumes exactly where it stopped:

  • results.jsonl — one row per scored sample: {model, reasoning, category, mode, image_id, primary_score, schema_valid, json_robust, raw_text, tokens_per_sec}. Resume keys on (model, reasoning, category, mode, image_id), so re-running the command skips completed work.
  • metrics.jsonl — per (model, category, mode) aggregates: primary_score_mean, schema_valid_rate, json_robustness, tokens_per_sec. This file is what the tables below are built from.
  • leaderboard.md / summary.json / summary.csv — the cross-model ranking, including the native-vs-constrained validity gap and the ship/finetune bucket per model.
  • run.log — a human-readable trace; each model's weights are deleted from the HF cache after it finishes (--clear-cache-after-model) so a 12-model sweep peaks at one model on disk, not all twelve.

The 15 tasks

task what it probes ground truth metric
image_classification object/scene recognition food-101 top-1 (tolerant)
bbox_grounding localization + grounded counting COCO mAP / IoU@0.5 F1
ocr_text reading + transcription TextVQA answer containment
data_type_differentiation recognize a rendered format synthetic exact format match
data_type_utilization re-serialize to JSON synthetic key/value F1
structural_spatial_awareness left/right/above/below synthetic triple F1
depth_analysis relative depth ordering synthetic pairwise order acc
subject_fixation primary salient subject synthetic IoU + label
segmentation instance masks (polygons) synthetic mIoU
outline_association object outline polygon synthetic polygon IoU
geometric_3d_object_id 3D boxes synthetic 3D center match
camera_rotational_offset camera roll synthetic acc@30°
semantic_association entity relations synthetic triple F1
vit_accuracy_to_prompt grounded VQA VQAv2 answer match
style_structural_awareness style + structure synthetic style accuracy

Results: the full 12-model ranking

Ranked by labeler_score (accuracy × validity × robustness, native json-mode). Mean yield is mean effective yield over all 15 categories with a fully-failed category counted as 0. native→constr valid shows the schema-valid rate in json-mode and the lift constrained decoding adds. N=12 per category.

rank model params (active) labeler mean yield native-valid constr-valid tok/s bucket
1 Qwen3.5-27B 27B 0.553 0.549 82.8% 87.2% ~19 finetune-candidate
2 Qwen3.5-35B-A3B 35B (3B) MoE 0.541 0.540 78.9% 85.0% ~27 finetune-candidate
3 Qwen3-VL-4B 4B 0.524 0.496 84.4% 86.6% ~73 finetune-candidate
4 Qwen3-VL-32B 32B 0.521 0.502 80.6% 87.3% ~18 finetune-candidate
5 Qwen3-VL-8B 8B 0.502 0.507 82.8% 89.5% ~57 finetune-candidate
6 Qwen3-VL-30B-A3B 30B (3B) MoE 0.500 0.459 78.3% 86.6% ~36 finetune-candidate
7 Qwen3.5-4B 4B 0.487 0.421 76.1% 85.5% ~49 finetune-candidate
8 Qwen3.5-9B 9B 0.461 0.486 79.4% 87.2% ~48 finetune-candidate
9 Qwen3-VL-2B 2B 0.391 0.348 82.8% 91.1% ~89 insufficient
10 Qwen3.5-2B 2B 0.300 0.251 71.1% 86.7% ~66 insufficient
11 Qwen3.5-0.8B 0.8B 0.139 0.126 58.3% 76.1% ~68 insufficient
12 Qwen3.5-0.8B (finetuned captioner) 0.8B 0.123 0.082 46.7% 73.9% ~70 insufficient

Two things to read off this table before the per-category detail:

  • The native→constr columns are the constrained-decoding story, quantified. The grammar lifts the schema-valid rate everywhere, but the lift is small for the strong models (VL-4B +2.2, 27B +4.4) and large for the weak ones (Qwen3.5-2B +15.6, the captioner +27.2). And it never reaches 100% — even the best constrained validity is ~91% — because the long-output categories truncate. Constraining fixes shape, not length.
  • Every model is bucketed finetune-candidate or insufficient, because none clears the harness's 90% native-robustness bar across all 15 categories at once. That's the 2–3 genuinely hard categories (3D, segmentation) dragging the aggregate down — not a claim that the models can't emit JSON. On the other 12 categories validity is high.

Per-category effective yield

The 10 native-capable candidates, every category. (The two 0.8B baselines rank last above and are omitted here — their rows are sparse, with several categories producing zero valid output.) n/a = the model emitted zero schema-valid outputs for that category; in the mean it counts as 0. Bold = the best model in that row.

category Q3.5-2B Q3.5-4B Q3.5-9B Q3.5-27B Q3.5-35B-A3B VL-2B VL-4B VL-8B VL-32B VL-30B-A3B
classification 0.00 0.33 0.42 0.42 0.50 0.08 0.00 0.17 0.33 0.42
bbox_grounding 0.16 0.08 0.25 0.18 0.20 0.27 0.30 0.23 0.18 0.19
ocr_text 0.18 0.42 0.50 0.33 0.42 0.42 0.42 0.34 0.33 0.42
data_type_diff 0.17 0.67 0.83 1.00 0.92 0.67 0.67 0.75 1.00 0.58
data_type_util 0.17 0.08 0.08 0.17 0.17 0.00 0.67 0.08 0.17 0.17
spatial relations 0.76 0.00 0.83 0.71 0.81 0.65 0.80 0.82 0.65 1.00
depth ordering 0.17 0.81 1.00 1.00 1.00 0.14 0.50 0.81 1.00 1.00
subject_fixation n/a 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
segmentation 0.00 0.00 0.00 0.39 0.16 0.01 0.42 0.51 0.22 0.00
outline 0.00 0.33 0.00 0.17 0.00 0.00 0.00 0.33 n/a 0.12
3D object id n/a n/a 0.00 0.00 n/a 0.00 n/a 0.03 0.00 n/a
camera roll 0.53 0.56 0.78 0.81 0.81 0.75 0.75 0.69 0.58 0.67
VQA 0.75 0.92 0.50 0.92 1.00 0.25 0.58 0.67 0.92 0.50
semantic relations 0.46 0.78 0.77 0.81 0.78 0.56 0.75 0.60 0.82 0.64
style 0.42 0.33 0.33 0.33 0.33 0.42 0.58 0.58 0.33 0.17
mean (15, n/a=0) 0.251 0.421 0.486 0.549 0.540 0.348 0.496 0.507 0.502 0.459

What the grid says

Scale helps — but architecture helps more. Inside the Qwen3.5 dense line, quality climbs roughly linearly with size: 4B → 9B → 27B is 0.421 → 0.486 → 0.549 mean yield. So if you've committed to Qwen3.5, bigger genuinely is better. But the Qwen3-VL line gets there far cheaper: VL-4B (0.496) already matches Qwen3.5-9B (0.486) and beats Qwen3.5-4B (0.421) by 0.075 — at the same or smaller parameter count. The grounding-heavy training of the VL line is worth more than a couple of billion extra parameters.

The quality ceiling is ~0.55, and two models share it. Qwen3.5-27B (0.549) is the top scorer, but Qwen3.5-35B-A3B ties it within noise (0.540) and decodes ~40% faster (~27 vs ~19 tok/s) because only ~3B of its parameters are active per token. For a quality-first pipeline that still needs throughput, the MoE is the pick; the dense 27B is the pick only if you want the single highest number.

Qwen3-VL-4B is the volume winner. It reaches 95% of the 27B's labeler score (0.524 vs 0.553) and ~97% of the 35B-A3B's (0.524 vs 0.541), at **73 tok/s — 3.8× the 27B's throughput and 2.7× the 35B-A3B's.** It's also the most balanced small model: it's the only sub-8B model that posts a real number on both data-format→JSON (0.67) and segmentation (0.42), while staying strong on the language tasks. For labeling at scale, this is the one.

The two 0.8B models are genuinely insufficient (0.126 and 0.082). The finetuned captioner is worse on this generic benchmark than the base 0.8B — because it was tuned to its own schema, so it reliably emits the wrong shape here, and its constrained validity (73.9%) is the lowest in the fleet. That's the expected cost of a narrow finetune evaluated off-distribution. It says "a schema-specific finetune doesn't generalize to other schemas" — not that finetuning is useless; a finetune matched to your schema is a different experiment this run doesn't cover.

Capability is task-specific, so route per task. No single model wins everywhere:

task best model yield
classification Qwen3.5-35B-A3B 0.50
OCR Qwen3.5-9B 0.50
depth ordering five-way tie (9B / 27B / 35B-A3B / VL-32B / VL-30B-A3B) 1.00
data-format recognition Qwen3.5-27B / VL-32B 1.00
data-format → JSON Qwen3-VL-4B 0.67
segmentation Qwen3-VL-8B 0.51
outline Qwen3-VL-8B / Q3.5-4B 0.33
spatial relations Qwen3-VL-30B-A3B 1.00
VQA Qwen3.5-35B-A3B 1.00
semantic relations Qwen3-VL-32B 0.82
style Qwen3-VL-4B / 8B 0.58

The pixel-grounding tasks (segmentation, outline) belong to the Qwen3-VL family; the language/structured tasks (classification, depth, OCR, VQA) lean Qwen3.5. A two-model router — a Qwen3-VL model for grounding, a Qwen3.5 model for the rest — maximizes dataset quality. If you want one model, Qwen3-VL-4B is the best all-round value and Qwen3.5-27B / 35B-A3B the quality ceiling.


Honest caveats

  • The geometric tasks use synthetic ground truth. Colored-shape scenes and rendered data-format screenshots isolate each capability and are fully reproducible, but they are not natural-image difficulty. Real-image variants (COCO masks, NYU depth, SUN-RGBD 3D) are the obvious next step.
  • 3D is unsolved here — every model is ~0.00. Recovering a 3D box from a single 2D image is genuinely hard, and the synthetic proxy is crude. This is the one task that clearly needs real 3D data or a finetune.
  • The camera task is easy to over-credit. Its synthetic ground truth fixes yaw and pitch at 0 and only varies roll, so a model that outputs [0, 0, guess] gets two of three axes for free. Read those numbers as "can it estimate roll," not full 6-DoF pose.
  • No model clears the 90% native-robustness bar across all 15 tasks, so every model buckets as "finetune-candidate." That's the strict threshold reacting to the 2–3 hard categories — not a statement that the models can't do JSON. Constrained decoding raises validity (the native→constr columns) but stops at ~85–91% because long polygon/3D outputs truncate, so the bar isn't cleared by constraining alone.
  • Throughput is measured on transformers, not vLLM (which would be faster). N=12 per category is a characterization run, not a leaderboard-grade N.

Reproduce

pip install -e ".[vlmbench]"
qwen-vlmbench --runner vlm --dataset full --n 12 \
  --modes json_mode constrained --clear-cache-after-model \
  --models qwen3vl-4b qwen3.5-27b \
  --categories segmentation bbox_grounding semantic_association

The full harness — registry-driven schemas, GBNF generation, the coordinate-normalization layer, tolerant label matching, the synthetic ground-truth generators, and the durable orchestrator — is in qwen_test_runner/vision/, with 102 CPU-only tests so the scoring logic is verifiable without a GPU.


Takeaways

  1. Measure effective yield. Accuracy-over-valid-only rewards models for failing on hard inputs; counting invalid (and a fully-failed category) as a miss is the honest metric, and it changes the ranking.
  2. Constrain for shape, budget for length. A grammar makes wrong-shape JSON impossible and lifts validity 2–27 points, but it can't stop a long polygon from truncating — so the hard categories still need either bigger token budgets or a finetune.
  3. You don't need a giant model. The quality curve tops out near 0.55; Qwen3-VL-4B delivers ~95% of that at ~3.8× the throughput, and Qwen3.5-35B-A3B is the fast-MoE route to the ceiling.
  4. Route per task. Qwen3-VL for grounding/segmentation, Qwen3.5 for language/structured — no single model wins all 15.

If you're building structured-output datasets from images, you can skip the 70B: point Qwen3-VL-4B (or Qwen3.5-27B/35B-A3B for the quality ceiling) at your images with a grammar, route the grounding tasks to the VL line, and you have a production image→JSON labeler today.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support