structured-extractor-qwen3vl-4b-exp93

A fine-tuned Qwen3-VL-4B-Instruct that extracts structured rows (name, value, date, unit) from images of financial-statement tables (Russian + English). This is the exp93 checkpoint — the project's reproducible gold recipe across multiple seeds.

Benchmarks

Evaluated on a held-out test split of real financial-statement table crops (no synthetic data in training), with the quality preset (num_beams=4, repetition_penalty=1.1, min_new_tokens=200):

Metric STRICT LENIENT*
tuple_f1 0.5316 0.5686

* "LENIENT" normalizes unit synonyms (millionmillions, млн руб.млн руб) and accepts a year match when reference and prediction differ only by date precision. See score_lenient.py in this repo.

For comparison, on the same eval:

Variant STRICT LENIENT
exp93 b4+rp1.1 (this repo) 0.5316 0.5686
exp93 greedy ~0.50 ~0.55
exp85 greedy (DoRA+MLP without wd=0.05) 0.5154

A best-of-3-seeds checkpoint (exp106) reaches LENIENT 0.5923, but it's a lucky training trajectory — not reproducible from this recipe alone, so this repo ships the recipe-reproducible exp93 instead.

⚠️ Earlier versions of this project reported t_f1 ~0.82 — those numbers were inflated by a target-leakage bug in the eval pipeline (the answer was in the model's input). The numbers above are real zero-shot, measured with a leak-free eval (PageDataset(..., eval_mode=True)).

Recipe

  • Base: Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
  • Adapter: DoRA-style LoRA, r=16, alpha=32, dropout=0.05
  • Targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (attention + MLP)
  • Training data: real financial tables only (~1.5k train, no synthetic augmentation)
  • Input: pre-cropped table image + markdown OCR of that table + date-column hint
  • Schedule: 2 epochs, AdamW, lr=1e-4, weight_decay=0.05, warmup_ratio=0.05
  • Saved as a full merged model (8.3 GB safetensors), not a PEFT adapter.

Quick start

pip install -r requirements.txt
from inference import StructuredExtractor

extractor = StructuredExtractor.from_pretrained(
    "Glazkov/structured-extractor-qwen3vl-4b-exp93"
)

result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="quality",
)

for row in result["parameters"]:
    print(row)
# {"parameter_name": "Interest income", "parameter_value": "533",
#  "parameter_date": "2024", "parameter_unit": "millions"}
# ...

Required inputs

The model was trained with three input streams. All three matter:

Input Status Notes
Table image Required Pre-cropped to a single table region; long-side resized to 1344px (handled internally)
Markdown OCR of that table Strongly recommended The per-sample disambiguator on multi-table pages. Without it the model picks an arbitrary table and tuple-F1 drops to near zero.
date_columns hint Optional List of date-column headers; helps when markdown is noisy

The table image must be cropped to the target table, not a full page. The training data uses single-table crops; full-page images at inference are untested and likely degrade quality.

Best-quality pipeline

result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="quality",
)

preset="quality" is num_beams=4, length_penalty=1.0, min_new_tokens=200, repetition_penalty=1.1, max_new_tokens=4096. This is the configuration that yields STRICT 0.5316 / LENIENT 0.5686.

Most-optimal pipeline (greedy)

result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="fast",
)

Greedy decoding (num_beams=1). About 3-4× faster than quality with a ~0.03-0.05 t_f1 drop. Use this when latency or throughput matters more than the last point of F1.

Batch inference

from pathlib import Path
from inference import StructuredExtractor

extractor = StructuredExtractor.from_pretrained(
    "Glazkov/structured-extractor-qwen3vl-4b-exp93"
)

paths = sorted(Path("tables/").glob("*.png"))
markdowns = [Path(p.with_suffix(".md")).read_text() for p in paths]
results = extractor.extract_batch(
    paths,
    markdown_batch=markdowns,
    preset="fast",
    batch_size=1,           # beam search is memory-hungry; keep at 1
)

See examples/batch.py for a CLI version. batch_size>1 is unsupported in this wrapper because beam-search batching requires the training-time collator (left-padding + cat of vision tensors) which is out of scope for an inference module.

Lenient scoring helper

score_lenient.py re-scores a JSONL of (image, parameters) predictions against a reference annotations JSONL using unit aliases and date-year normalization. The +0.037 LENIENT lift in the benchmark table comes from this scorer; the model output itself is identical.

python score_lenient.py preds.jsonl annotations_test.jsonl

Output format

The model emits one parameter per line in pipe-separated sep_labels format:

<|sep_meta|>
name: Interest income|value: 533|date: 2024|unit: millions
name: Foreign-currency transaction loss|value: 89|date: 2023|unit: millions

parser.py (in this repo) converts that to {"parameters": [{...}, ...]}. The parser strips stray <|...|> control-token artifacts before splitting — the model occasionally emits one mid-row, and without this strip a leading < contaminates the previous field. This fix is worth +0.024-0.042 t_f1 on its own.

Loading details

StructuredExtractor.from_pretrained does three things you'd otherwise need to wire up yourself:

  1. Loads the processor (image processor + chat template) — first tries the uploaded checkpoint, falls back to Qwen/Qwen3-VL-4B-Instruct if the preprocessor configs aren't present.
  2. Swaps in the fine-tuned tokenizer (which has the 4 added special tokens: <|sep_meta|>, <|sep_columns|>, <|sep_rows|>, <|sep_end|>).
  3. Force-injects <|sep_meta|>\n as the assistant-turn prefix before generate(). This token is masked out of training labels — the model never learned to emit it, so we have to prime it.

Hardware

Preset Min VRAM (single image)
fast (greedy) ~12 GB
quality (beam=4) ~24 GB

bf16 on CUDA capability ≥ 8.0, float16 elsewhere. CPU works but is unusably slow for a 4B VLM with beam search.

Limitations

  • Trained on financial-statement tables (RU/EN). Behavior on other domains is unmeasured.
  • Bimodal errors: ~42% of test samples solve well (t_f1 ≥ 0.7), ~34% fail completely (t_f1 < 0.1). Worst failures cluster in specific source documents (multi-table pages where the markdown disambiguator alone isn't enough).
  • Single-seed numbers vary a lot on this small dataset (~0.20 range across 4 seeds). exp93 is reproducible, but don't expect another retrain to land at exactly the same F1.

License

Apache-2.0, matching the base model.

Citation / acknowledgements

  • Base model: Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
  • Training framework: structured-extractor-train (DoRA r=16 + MLP, 2 epochs, real-only)
Downloads last month
104
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Glazkov/structured-extractor-qwen3vl-4b-exp93

Finetuned
(292)
this model