Instructions to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Glazkov/structured-extractor-qwen3vl-4b-exp93") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Glazkov/structured-extractor-qwen3vl-4b-exp93") model = AutoModelForImageTextToText.from_pretrained("Glazkov/structured-extractor-qwen3vl-4b-exp93") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Glazkov/structured-extractor-qwen3vl-4b-exp93" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Glazkov/structured-extractor-qwen3vl-4b-exp93", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Glazkov/structured-extractor-qwen3vl-4b-exp93
- SGLang
How to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Glazkov/structured-extractor-qwen3vl-4b-exp93" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Glazkov/structured-extractor-qwen3vl-4b-exp93", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Glazkov/structured-extractor-qwen3vl-4b-exp93" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Glazkov/structured-extractor-qwen3vl-4b-exp93", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with Docker Model Runner:
docker model run hf.co/Glazkov/structured-extractor-qwen3vl-4b-exp93
structured-extractor-qwen3vl-4b-exp93
A fine-tuned Qwen3-VL-4B-Instruct that extracts structured rows
(name, value, date, unit) from images of financial-statement tables
(Russian + English). This is the exp93 checkpoint — the project's
reproducible gold recipe across multiple seeds.
Benchmarks
Evaluated on a held-out test split of real financial-statement table crops
(no synthetic data in training), with the quality preset
(num_beams=4, repetition_penalty=1.1, min_new_tokens=200):
| Metric | STRICT | LENIENT* |
|---|---|---|
| tuple_f1 | 0.5316 | 0.5686 |
* "LENIENT" normalizes unit synonyms (million ↔ millions, млн руб. ↔ млн руб) and accepts a year match when reference and prediction differ only by date precision. See score_lenient.py in this repo.
For comparison, on the same eval:
| Variant | STRICT | LENIENT |
|---|---|---|
| exp93 b4+rp1.1 (this repo) | 0.5316 | 0.5686 |
| exp93 greedy | ~0.50 | ~0.55 |
exp85 greedy (DoRA+MLP without wd=0.05) |
0.5154 | — |
A best-of-3-seeds checkpoint (exp106) reaches LENIENT 0.5923, but it's a lucky training trajectory — not reproducible from this recipe alone, so this repo ships the recipe-reproducible exp93 instead.
⚠️ Earlier versions of this project reported t_f1 ~0.82 — those numbers were inflated by a target-leakage bug in the eval pipeline (the answer was in the model's input). The numbers above are real zero-shot, measured with a leak-free eval (
PageDataset(..., eval_mode=True)).
Recipe
- Base:
Qwen/Qwen3-VL-4B-Instruct(Apache-2.0) - Adapter: DoRA-style LoRA,
r=16,alpha=32,dropout=0.05 - Targets:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj(attention + MLP) - Training data: real financial tables only (~1.5k train, no synthetic augmentation)
- Input: pre-cropped table image + markdown OCR of that table + date-column hint
- Schedule: 2 epochs, AdamW,
lr=1e-4,weight_decay=0.05,warmup_ratio=0.05 - Saved as a full merged model (8.3 GB safetensors), not a PEFT adapter.
Quick start
pip install -r requirements.txt
from inference import StructuredExtractor
extractor = StructuredExtractor.from_pretrained(
"Glazkov/structured-extractor-qwen3vl-4b-exp93"
)
result = extractor.extract(
"table_crop.png",
markdown=table_md_text,
date_columns=["2024", "2023"],
preset="quality",
)
for row in result["parameters"]:
print(row)
# {"parameter_name": "Interest income", "parameter_value": "533",
# "parameter_date": "2024", "parameter_unit": "millions"}
# ...
Required inputs
The model was trained with three input streams. All three matter:
| Input | Status | Notes |
|---|---|---|
| Table image | Required | Pre-cropped to a single table region; long-side resized to 1344px (handled internally) |
| Markdown OCR of that table | Strongly recommended | The per-sample disambiguator on multi-table pages. Without it the model picks an arbitrary table and tuple-F1 drops to near zero. |
date_columns hint |
Optional | List of date-column headers; helps when markdown is noisy |
The table image must be cropped to the target table, not a full page. The training data uses single-table crops; full-page images at inference are untested and likely degrade quality.
Best-quality pipeline
result = extractor.extract(
"table_crop.png",
markdown=table_md_text,
date_columns=["2024", "2023"],
preset="quality",
)
preset="quality" is num_beams=4, length_penalty=1.0, min_new_tokens=200, repetition_penalty=1.1, max_new_tokens=4096. This is the configuration that
yields STRICT 0.5316 / LENIENT 0.5686.
Most-optimal pipeline (greedy)
result = extractor.extract(
"table_crop.png",
markdown=table_md_text,
date_columns=["2024", "2023"],
preset="fast",
)
Greedy decoding (num_beams=1). About 3-4× faster than quality with a
~0.03-0.05 t_f1 drop. Use this when latency or throughput matters more than
the last point of F1.
Batch inference
from pathlib import Path
from inference import StructuredExtractor
extractor = StructuredExtractor.from_pretrained(
"Glazkov/structured-extractor-qwen3vl-4b-exp93"
)
paths = sorted(Path("tables/").glob("*.png"))
markdowns = [Path(p.with_suffix(".md")).read_text() for p in paths]
results = extractor.extract_batch(
paths,
markdown_batch=markdowns,
preset="fast",
batch_size=1, # beam search is memory-hungry; keep at 1
)
See examples/batch.py for a CLI version. batch_size>1 is unsupported in
this wrapper because beam-search batching requires the training-time
collator (left-padding + cat of vision tensors) which is out of scope for an
inference module.
Lenient scoring helper
score_lenient.py re-scores a JSONL of (image, parameters) predictions
against a reference annotations JSONL using unit aliases and date-year
normalization. The +0.037 LENIENT lift in the benchmark table comes from
this scorer; the model output itself is identical.
python score_lenient.py preds.jsonl annotations_test.jsonl
Output format
The model emits one parameter per line in pipe-separated sep_labels format:
<|sep_meta|>
name: Interest income|value: 533|date: 2024|unit: millions
name: Foreign-currency transaction loss|value: 89|date: 2023|unit: millions
parser.py (in this repo) converts that to {"parameters": [{...}, ...]}.
The parser strips stray <|...|> control-token artifacts before splitting —
the model occasionally emits one mid-row, and without this strip a leading
< contaminates the previous field. This fix is worth +0.024-0.042 t_f1 on
its own.
Loading details
StructuredExtractor.from_pretrained does three things you'd otherwise need
to wire up yourself:
- Loads the processor (image processor + chat template) — first tries the
uploaded checkpoint, falls back to
Qwen/Qwen3-VL-4B-Instructif the preprocessor configs aren't present. - Swaps in the fine-tuned tokenizer (which has the 4 added special tokens:
<|sep_meta|>,<|sep_columns|>,<|sep_rows|>,<|sep_end|>). - Force-injects
<|sep_meta|>\nas the assistant-turn prefix beforegenerate(). This token is masked out of training labels — the model never learned to emit it, so we have to prime it.
Hardware
| Preset | Min VRAM (single image) |
|---|---|
| fast (greedy) | ~12 GB |
| quality (beam=4) | ~24 GB |
bf16 on CUDA capability ≥ 8.0, float16 elsewhere. CPU works but is unusably slow for a 4B VLM with beam search.
Limitations
- Trained on financial-statement tables (RU/EN). Behavior on other domains is unmeasured.
- Bimodal errors: ~42% of test samples solve well (t_f1 ≥ 0.7), ~34% fail completely (t_f1 < 0.1). Worst failures cluster in specific source documents (multi-table pages where the markdown disambiguator alone isn't enough).
- Single-seed numbers vary a lot on this small dataset (~0.20 range across 4 seeds). exp93 is reproducible, but don't expect another retrain to land at exactly the same F1.
License
Apache-2.0, matching the base model.
Citation / acknowledgements
- Base model:
Qwen/Qwen3-VL-4B-Instruct(Apache-2.0) - Training framework:
structured-extractor-train(DoRA r=16 + MLP, 2 epochs, real-only)
- Downloads last month
- 104
Model tree for Glazkov/structured-extractor-qwen3vl-4b-exp93
Base model
Qwen/Qwen3-VL-4B-Instruct