A3-Doc

A3-Doc is a document + chart understanding specialist in the Schneewolf Labs A-series — a focused-excellence Stage-2 full fine-tune of A3 on ChartDocMix-v1. Where A3-Instruct is the generalist sibling, A3-Doc trades breadth for depth on the ChartQA / DocVQA / InfoVQA / TextVQA / OCRBench class of tasks.

What it is

Architecture Qwen3-VL ViT (frozen, ~0.41 B) + 2-layer MLP projector (trained) + A2/Mistral decoder (full FFT)
Total params 12.69 B (12.28 B trainable in Stage-2; ViT frozen)
Base schneewolflabs/A3
Training corpus schneewolflabs/ChartDocMix-v1 (241,435 rows: ~96% doc/chart/OCR VQA + 4% identity rehearsal)
Epochs 1 (15,075 steps)
Effective batch 16 (bs 1 × grad-accum 16)
Optimizer paged AdamW 8-bit
Learning rate 1e-5, cosine, warmup 3%
Max seq length 2048
Vision token cap max_pixels = 512×512 (262 K px) — see the resolution note below
Hardware 1× NVIDIA GB10 (DGX Spark, 128 GB unified)
Wall-clock ~3.3 days
Final eval loss 0.499 (down from 0.647 at the first eval)

The single-domain corpus is far more learnable than the generalist mix: A3-Doc reaches eval loss 0.499, well under A3-Instruct's 0.752 on the broad corpus.

Benchmarks

Greedy decoding, lmms-eval terse-answer prompt convention, 500-row validation slices (a fast read — see caveats). Metrics: ChartQA relaxed accuracy, DocVQA / InfoVQA ANLS, TextVQA VQA-accuracy, OCRBench contains-accuracy.

Benchmark A3-Doc Metric
ChartQA 53.2 relaxed acc
DocVQA 48.4 ANLS
InfoVQA 34.2 ANLS
TextVQA 71.6 VQA acc
OCRBench 67.0 (670/1000) contains

For a Path-B graft trained on 241 K rows, TextVQA and OCRBench are genuinely respectable — scene-text and OCR transferred well. DocVQA/InfoVQA are the weak spots, and the reason is known (below).

Caveats: numbers are a 500-row slice, not full splits. ChartQA's test split interleaves human_test (harder) and augmented_test (easier) and the published number averages both — a flat 500-row sample may over-represent one type. Treat these as indicative, not leaderboard-final.

The resolution finding (important)

A3-Doc was trained and evaluated at max_pixels = 512×512. DocVQA and InfoVQA are high-resolution document scans where text is tiny, so at 512² much of the text is illegible. This is the dominant limiter on those two tasks.

Diagnostic — eval-only, no retraining, same 200 rows:

Benchmark @512² (262 K px) @1280² (1.64 M px) Δ
DocVQA (ANLS) 0.525 0.580 +5.5
InfoVQA (ANLS) 0.385 0.420 +3.5

The frozen ViT + projector + decoder generalize to higher visual-token counts despite only seeing 512² in training. The eval-only gain is a floor; a retrain at higher max_pixels should beat it. If you run A3-Doc yourself, raise max_pixels (the ArtemisVLMProcessor accepts it) for document tasks — it costs more tokens/latency but helps.

Intended use

Document & chart VQA, infographic QA, OCR-style reading, chart captioning. For broad conversation/creative use reach for A3-Instruct; for dense image captioning reach for A3.

Inference

from transformers import AutoConfig, AutoTokenizer
from artemis_vlm import ArtemisVLMForConditionalGeneration, ArtemisVLMProcessor
import torch

ckpt = "schneewolflabs/A3-Doc"
model = ArtemisVLMForConditionalGeneration.from_pretrained(ckpt, dtype=torch.bfloat16).to("cuda")
cfg = AutoConfig.from_pretrained(ckpt, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(ckpt, trust_remote_code=True)
# raise max_pixels for document tasks (training default was 512*512):
proc = ArtemisVLMProcessor(tokenizer=tok, vision_config=cfg.vision_config,
                           max_pixels=1280*1280)

Also runs in llama.cpp via the Schneewolf-Labs/llama.cpp fork's Artemis VLM mmproj graft (same pattern as A3 / A3-Instruct).

Roadmap — A3-Doc-v2

The resolution finding points to the obvious next lever: retrain at 1024²–1280² max_pixels rather than 512². Same corpus, same recipe, higher vision budget. Expected to push DocVQA/InfoVQA well past the eval-only gains.

Lineage

License

apache-2.0, consistent with the rest of the A-series lineage. Constituent training sources carry their own licenses (see the ChartDocMix-v1 card).

Downloads last month
-
Safetensors
Model size
13B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for schneewolflabs/A3-Doc

Finetuned
(2)
this model

Dataset used to train schneewolflabs/A3-Doc