Instructions to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Glazkov/structured-extractor-qwen3vl-4b-exp93")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Glazkov/structured-extractor-qwen3vl-4b-exp93")
model = AutoModelForImageTextToText.from_pretrained("Glazkov/structured-extractor-qwen3vl-4b-exp93")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Glazkov/structured-extractor-qwen3vl-4b-exp93"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Glazkov/structured-extractor-qwen3vl-4b-exp93",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Glazkov/structured-extractor-qwen3vl-4b-exp93

SGLang

How to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Glazkov/structured-extractor-qwen3vl-4b-exp93" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Glazkov/structured-extractor-qwen3vl-4b-exp93",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Glazkov/structured-extractor-qwen3vl-4b-exp93" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Glazkov/structured-extractor-qwen3vl-4b-exp93",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Glazkov/structured-extractor-qwen3vl-4b-exp93 with Docker Model Runner:
```
docker model run hf.co/Glazkov/structured-extractor-qwen3vl-4b-exp93
```

structured-extractor-qwen3vl-4b-exp93

A fine-tuned Qwen3-VL-4B-Instruct that extracts structured rows (name, value, date, unit) from images of financial-statement tables (Russian + English). This is the exp93 checkpoint — the project's reproducible gold recipe across multiple seeds.

Benchmarks

Evaluated on a held-out test split of real financial-statement table crops (no synthetic data in training), with the quality preset (num_beams=4, repetition_penalty=1.1, min_new_tokens=200):

Metric	STRICT	LENIENT*
tuple_f1	0.5316	0.5686

* "LENIENT" normalizes unit synonyms (million ↔ millions, млн руб. ↔ млн руб) and accepts a year match when reference and prediction differ only by date precision. See score_lenient.py in this repo.

For comparison, on the same eval:

Variant	STRICT	LENIENT
exp93 b4+rp1.1 (this repo)	0.5316	0.5686
exp93 greedy	~0.50	~0.55
exp85 greedy (DoRA+MLP without `wd=0.05`)	0.5154	—

A best-of-3-seeds checkpoint (exp106) reaches LENIENT 0.5923, but it's a lucky training trajectory — not reproducible from this recipe alone, so this repo ships the recipe-reproducible exp93 instead.

⚠️ Earlier versions of this project reported t_f1 ~0.82 — those numbers were inflated by a target-leakage bug in the eval pipeline (the answer was in the model's input). The numbers above are real zero-shot, measured with a leak-free eval (PageDataset(..., eval_mode=True)).

Recipe

Base: Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
Adapter: DoRA-style LoRA, r=16, alpha=32, dropout=0.05
Targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (attention + MLP)
Training data: real financial tables only (~1.5k train, no synthetic augmentation)
Input: pre-cropped table image + markdown OCR of that table + date-column hint
Schedule: 2 epochs, AdamW, lr=1e-4, weight_decay=0.05, warmup_ratio=0.05
Saved as a full merged model (8.3 GB safetensors), not a PEFT adapter.

Quick start

pip install -r requirements.txt

from inference import StructuredExtractor

extractor = StructuredExtractor.from_pretrained(
    "Glazkov/structured-extractor-qwen3vl-4b-exp93"
)

result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="quality",
)

for row in result["parameters"]:
    print(row)
# {"parameter_name": "Interest income", "parameter_value": "533",
#  "parameter_date": "2024", "parameter_unit": "millions"}
# ...

Required inputs

The model was trained with three input streams. All three matter:

Input	Status	Notes
Table image	Required	Pre-cropped to a single table region; long-side resized to 1344px (handled internally)
Markdown OCR of that table	Strongly recommended	The per-sample disambiguator on multi-table pages. Without it the model picks an arbitrary table and tuple-F1 drops to near zero.
`date_columns` hint	Optional	List of date-column headers; helps when markdown is noisy

The table image must be cropped to the target table, not a full page. The training data uses single-table crops; full-page images at inference are untested and likely degrade quality.

Best-quality pipeline

result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="quality",
)

preset="quality" is num_beams=4, length_penalty=1.0, min_new_tokens=200, repetition_penalty=1.1, max_new_tokens=4096. This is the configuration that yields STRICT 0.5316 / LENIENT 0.5686.

Most-optimal pipeline (greedy)

result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="fast",
)

Greedy decoding (num_beams=1). About 3-4× faster than quality with a ~0.03-0.05 t_f1 drop. Use this when latency or throughput matters more than the last point of F1.

Batch inference

from pathlib import Path
from inference import StructuredExtractor

extractor = StructuredExtractor.from_pretrained(
    "Glazkov/structured-extractor-qwen3vl-4b-exp93"
)

paths = sorted(Path("tables/").glob("*.png"))
markdowns = [Path(p.with_suffix(".md")).read_text() for p in paths]
results = extractor.extract_batch(
    paths,
    markdown_batch=markdowns,
    preset="fast",
    batch_size=1,           # beam search is memory-hungry; keep at 1
)

See examples/batch.py for a CLI version. batch_size>1 is unsupported in this wrapper because beam-search batching requires the training-time collator (left-padding + cat of vision tensors) which is out of scope for an inference module.

Lenient scoring helper

score_lenient.py re-scores a JSONL of (image, parameters) predictions against a reference annotations JSONL using unit aliases and date-year normalization. The +0.037 LENIENT lift in the benchmark table comes from this scorer; the model output itself is identical.

python score_lenient.py preds.jsonl annotations_test.jsonl

Output format

The model emits one parameter per line in pipe-separated sep_labels format:

<|sep_meta|>
name: Interest income|value: 533|date: 2024|unit: millions
name: Foreign-currency transaction loss|value: 89|date: 2023|unit: millions

parser.py (in this repo) converts that to {"parameters": [{...}, ...]}. The parser strips stray <|...|> control-token artifacts before splitting — the model occasionally emits one mid-row, and without this strip a leading < contaminates the previous field. This fix is worth +0.024-0.042 t_f1 on its own.

Loading details

StructuredExtractor.from_pretrained does three things you'd otherwise need to wire up yourself:

Loads the processor (image processor + chat template) — first tries the uploaded checkpoint, falls back to Qwen/Qwen3-VL-4B-Instruct if the preprocessor configs aren't present.
Swaps in the fine-tuned tokenizer (which has the 4 added special tokens: <|sep_meta|>, <|sep_columns|>, <|sep_rows|>, <|sep_end|>).
Force-injects <|sep_meta|>\n as the assistant-turn prefix before generate(). This token is masked out of training labels — the model never learned to emit it, so we have to prime it.

Hardware

Preset	Min VRAM (single image)
fast (greedy)	~12 GB
quality (beam=4)	~24 GB

bf16 on CUDA capability ≥ 8.0, float16 elsewhere. CPU works but is unusably slow for a 4B VLM with beam search.

Limitations

Trained on financial-statement tables (RU/EN). Behavior on other domains is unmeasured.
Bimodal errors: ~42% of test samples solve well (t_f1 ≥ 0.7), ~34% fail completely (t_f1 < 0.1). Worst failures cluster in specific source documents (multi-table pages where the markdown disambiguator alone isn't enough).
Single-seed numbers vary a lot on this small dataset (~0.20 range across 4 seeds). exp93 is reproducible, but don't expect another retrain to land at exactly the same F1.

License

Apache-2.0, matching the base model.

Citation / acknowledgements

Base model: Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
Training framework: structured-extractor-train (DoRA r=16 + MLP, 2 epochs, real-only)

Downloads last month: 104

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Glazkov/structured-extractor-qwen3vl-4b-exp93

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

(292)

this model