Unlimited-OCR β Core AI (on-device document OCR)
On-device document β structured-markdown OCR, end-to-end on Apple Core AI. A port of
baidu/Unlimited-OCR (3B-A0.5B MoE, MIT): drop a
document image, get back markdown β tables as HTML (<table><tr><td>β¦), formulas as LaTeX,
reading order, and <|det|> layout boxes. Japanese + English + multilingual.
Runs on the stock coreai.runtime with no engine patch β the decoder is driven directly
on inputs_embeds, so this is a pure-export port (not the static-input-buffer VLM path).
What's exciting (why you'd use it)
- Private OCR: invoices, receipts, contracts, papers, forms never leave the device.
- Structured, not just text: tables β HTML, equations β LaTeX, layout β boxes. RAG-ready ingestion.
- Flat latency: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask)
keeps every tensor shape constant, so the runtime compiles once and decode stays flat at
12.7 ms/token (79 tok/s on M4 Max) β no growing-cache recompilation stalls. - SOTA quality: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).
Bundles
| path | what | dtype | size |
|---|---|---|---|
vision/unlimited_ocr_vision.aimodel |
DeepEncoder (SAM-ViT + CLIP-ViT cascade) β 100 visual tokens | fp16 | 762 MB |
decoder/unlimited_ocr_decoder.aimodel |
DeepseekV2 R-SWA MoE decoder, functions prefill + decode sharing one weight set + KV state |
sym8 | 3.2 GB |
assets/embed_tokens.f16 |
token embedding table [129280,1280] (host row-gather) |
fp16 | 316 MB |
assets/{image_newline,view_seperator}.f16, assets/prompt_input_ids.i32, assets/recipe.json |
arrangement constants + the assembly recipe | β | tiny |
tokenizer/ |
fast tokenizer (tokenizer.json + configs) |
β | β |
Pipeline (Base mode, 640px)
image β preprocess (pad to 640Β², normalize mean=std=0.5)
β vision .aimodel β visual tokens [1,100,1280]
β arrange (10Γ10 + image_newline per row + view_seperator) β [111,1280]
β scatter into embed_tokens(prompt_ids) β prefix [1,115,1280]
β decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) β tokens
β detokenize (keep special tokens) β markdown
The exact, verified recipe is in assets/recipe.json. Reference implementations (Python end-to-end
- a macOS app, CoreAIOCR, driving the stock runtime) are in the
Core AI Model Zoo:
conversion/unlimited_ocr/andapps/CoreAIOCR/.
Notes
- Appropriate input: clean single-page documents (invoice / paper / report / table / formula),
roughly square or portrait, with text still legible when fit to 640Β². Very dense small-text scans
(newspaper) want the tiled
crop_modevision export (not included here; Base mode only). - Prompt is fixed to
document parsing(layout + structured extraction). - License: MIT (inherited from
baidu/Unlimited-OCR).
Community port β not affiliated with Apple or baidu.
Model tree for mlboydaisuke/Unlimited-OCR-CoreAI
Base model
baidu/Unlimited-OCR