Instructions to use Glazkov/structured-extractor-qwen3vl-4b-exp232 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Glazkov/structured-extractor-qwen3vl-4b-exp232 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-4B-Instruct") model = PeftModel.from_pretrained(base_model, "Glazkov/structured-extractor-qwen3vl-4b-exp232") - Notebooks
- Google Colab
- Kaggle
structured-extractor-qwen3vl-4b-exp232
A fine-tuned Qwen3-VL-4B-Instruct that extracts structured rows
(name, value, date, unit) from images of financial-statement tables
(Russian + English).
This is the exp232 checkpoint โ the project's current gold. Shipped as a PEFT (DoRA) adapter (~194 MB) instead of a merged model; see the Why an adapter? section below for the +0.13 t_f1 this design choice unlocks.
Benchmarks
Evaluated on a held-out test split of real financial-statement table crops,
with the quality preset (num_beams=4, repetition_penalty=1.1, length_penalty=1.0, min_new_tokens=200, max_new_tokens=4096):
| Metric | Value |
|---|---|
| tuple_f1 (STRICT) | 0.6637 |
| parameter_f1 | 0.750 |
| value_accuracy | 0.820 |
| date_accuracy | 0.788 |
| unit_accuracy | 0.740 |
| count_accuracy | 0.78 |
| exact_match | 0.40 |
3-seed stability (this recipe re-run with seeds 42 / 2024 / 314):
| Decoder | Mean | Std |
|---|---|---|
| greedy | 0.624 | ยฑ0.007 |
| b4 + rp1.1 (quality) | 0.660 | ยฑ0.003 |
Seed variance is ~11ร tighter than the previous merged-save recipe โ saving the adapter unmerged also stabilizes seed-to-seed jitter.
Why an adapter?
The earlier recipe (exp93, also DoRA + MLP, same data, same hyperparameters)
was saved as a merged model and scored 0.5316. exp232 is the same recipe
except save_adapter_only=true, and scores 0.6637 โ a +0.132 lift
from a single config flag.
Root cause: DoRA's merge_and_unload followed by save-to-bf16 silently
degrades the directional component of the DoRA decomposition. Loading the
unmerged adapter and merging in memory at inference time recovers the full
precision. Confirmed across multiple seeds. Discussed in the
structured-extractor-train project notes (2026-05-26).
This is also why this repo is library_name: peft โ the file layout is the
standard PEFT one (adapter_config.json + adapter_model.safetensors)
plus an extra_trained_weights.pt for non-LoRA trained pieces (new-token
embed/lm_head rows + frozen vision merger snapshot).
โ ๏ธ Earlier versions of this project reported t_f1 ~0.82 โ those numbers were inflated by a target-leakage bug in the eval pipeline (the answer was in the model's input). The numbers above are real zero-shot, measured with a leak-free eval (
PageDataset(..., eval_mode=True)).
Recipe
- Base:
Qwen/Qwen3-VL-4B-Instruct(Apache-2.0) - Adapter: DoRA-style LoRA,
r=16,alpha=32,dropout=0.05 - Targets:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj(attention + MLP) - Training data: real financial tables only (~1.5k train, no synthetic augmentation)
- Input: pre-cropped table image + markdown OCR of that table + date-column hint
- Schedule: 2 epochs, AdamW,
lr=1e-4,weight_decay=0.05,warmup_ratio=0.05,seed=42 - Save mode:
save_adapter_only=trueโ adapter weights persisted unmerged in fp32-equivalent precision.
Quick start
pip install -r requirements.txt # torch, transformers, accelerate, peft, huggingface_hub, pillow
from inference import StructuredExtractor
extractor = StructuredExtractor.from_pretrained(
"Glazkov/structured-extractor-qwen3vl-4b-exp232"
)
result = extractor.extract(
"table_crop.png",
markdown=table_md_text,
date_columns=["2024", "2023"],
preset="quality",
)
for row in result["parameters"]:
print(row)
# {"parameter_name": "Interest income", "parameter_value": "533",
# "parameter_date": "2024", "parameter_unit": "millions"}
The first call downloads the base model (Qwen/Qwen3-VL-4B-Instruct, ~8 GB) and the
adapter (this repo, ~194 MB). The loader then:
- Loads the base in bf16 (or fp16 on older GPUs).
- Resizes token embeddings to fit the fine-tuned tokenizer (4 added sep tokens).
- Applies the DoRA adapter via PEFT and merges it.
- Restores new-token embed/lm_head rows + visual-merger snapshot from
extra_trained_weights.pt.
All four steps are handled inside StructuredExtractor.from_pretrained.
Required inputs
| Input | Status | Notes |
|---|---|---|
| Table image | Required | Pre-cropped to a single table region; long-side resized to 1344px (handled internally) |
| Markdown OCR of that table | Required for benchmark quality | The per-sample disambiguator. Without it the model picks an arbitrary table and tuple-F1 collapses to near zero. The VLM essentially copies cell text from markdown โ image alone is insufficient. |
date_columns hint |
Optional | List of date-column headers; helps when markdown is noisy |
The table image must be cropped to the target table, not a full page. Training used single-table crops; full-page inputs at inference time are untested.
Best-quality pipeline
result = extractor.extract(
"table_crop.png",
markdown=table_md_text,
date_columns=["2024", "2023"],
preset="quality",
)
preset="quality" = num_beams=4, length_penalty=1.0, min_new_tokens=200, repetition_penalty=1.1, max_new_tokens=4096, do_sample=False. This is the
configuration that yields STRICT 0.6637.
Most-optimal pipeline (greedy)
result = extractor.extract(
"table_crop.png",
markdown=table_md_text,
preset="fast",
)
Greedy decoding (num_beams=1). About 3-4ร faster than quality with a
~0.04 t_f1 drop (STRICT 0.6203). Use this when latency or throughput matters.
Batch inference
from pathlib import Path
from inference import StructuredExtractor
extractor = StructuredExtractor.from_pretrained(
"Glazkov/structured-extractor-qwen3vl-4b-exp232"
)
paths = sorted(Path("tables/").glob("*.png"))
markdowns = [Path(p.with_suffix(".md")).read_text() for p in paths]
results = extractor.extract_batch(
paths,
markdown_batch=markdowns,
preset="fast",
batch_size=1, # beam search is memory-hungry; keep at 1
)
See examples/batch.py for a CLI version. batch_size>1 is unsupported in
this wrapper because beam-search batching requires the training-time
collator (left-padding + cat of vision tensors), out of scope for the
inference module.
Lenient scoring helper
score_lenient.py re-scores a JSONL of (image, parameters) predictions
against a reference annotations JSONL using unit aliases (million โ millions,
ะผะปะฝ ััะฑ. โ ะผะปะฝ ััะฑ) and date-year normalization. A pure metric helper โ
the model output itself is identical; the lift comes from accepting
orthographic equivalents.
python score_lenient.py preds.jsonl annotations_test.jsonl
Output format
The model emits one parameter per line in pipe-separated sep_labels format:
<|sep_meta|>
name: Interest income|value: 533|date: 2024|unit: millions
name: Foreign-currency transaction loss|value: 89|date: 2023|unit: millions
parser.py converts that to {"parameters": [{...}, ...]} and strips
stray <|...|> control-token artifacts before splitting. The model
occasionally emits one mid-row; without this strip a leading < contaminates
the previous field. The fix is worth +0.024-0.042 t_f1 on its own.
File layout
.
โโโ adapter_config.json # PEFT/DoRA config
โโโ adapter_model.safetensors # adapter weights (~190 MB)
โโโ extra_trained_weights.pt # new-token embed/lm_head rows + visual_merger
โโโ chat_template.jinja # qwen3-vl chat template
โโโ tokenizer.json
โโโ tokenizer_config.json
โโโ inference.py # StructuredExtractor wrapper
โโโ parser.py # sep_labels โ structured rows (with regex strip)
โโโ score_lenient.py # lenient F1 helper
โโโ README.md # this file
โโโ LICENSE # Apache-2.0
โโโ requirements.txt
โโโ examples/
โโโ single_quality.py
โโโ single_fast.py
โโโ batch.py
Hardware
| Preset | Min VRAM (single image) |
|---|---|
| fast (greedy) | ~12 GB |
| quality (beam=4) | ~24 GB |
bf16 on CUDA capability โฅ 8.0, fp16 elsewhere. CPU works but is unusably slow for a 4B VLM with beam search.
Limitations
- Trained on financial-statement tables (RU/EN). Behavior on other domains is unmeasured.
- Bimodal errors: ~42% of test samples solve well (t_f1 โฅ 0.7), ~34% fail completely (t_f1 < 0.1). Average F1 obscures this. Worst failures cluster in specific source documents with dense multi-table pages where even the markdown disambiguator isn't enough.
- Markdown OCR is a hard requirement. The model cannot reliably OCR table cells from the image alone โ it leans heavily on the markdown for cell text. Production pipelines need an upstream OCR step.
License
Apache-2.0, matching the base model.
Citation / acknowledgements
- Base model:
Qwen/Qwen3-VL-4B-Instruct(Apache-2.0) - Training framework:
structured-extractor-train(DoRA r=16 + MLP, 2 epochs, real-only,save_adapter_only=true) - Earlier checkpoint:
structured-extractor-qwen3vl-4b-exp93(same recipe, merged save, STRICT 0.5316)
- Downloads last month
- 84
Model tree for Glazkov/structured-extractor-qwen3vl-4b-exp232
Base model
Qwen/Qwen3-VL-4B-Instruct