Unlimited-OCR → Core AI (on-device document OCR)

On-device document → structured-markdown OCR, end-to-end on Apple Core AI. A port of baidu/Unlimited-OCR (3B-A0.5B MoE, MIT): drop a document image, get back markdown — tables as HTML (<table><tr><td>…), formulas as LaTeX, reading order, and <|det|> layout boxes. Japanese + English + multilingual.

Runs on the stock coreai.runtime with no engine patch — the decoder is driven directly on inputs_embeds, so this is a pure-export port (not the static-input-buffer VLM path).

What's exciting (why you'd use it)

Private OCR: invoices, receipts, contracts, papers, forms never leave the device.
Structured, not just text: tables → HTML, equations → LaTeX, layout → boxes. RAG-ready ingestion.
Flat latency: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask) keeps every tensor shape constant, so the runtime compiles once and decode stays flat at ~~12.7 ms/token (~~79 tok/s on M4 Max) — no growing-cache recompilation stalls.
SOTA quality: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).

Bundles

path	what	dtype	size
`vision/unlimited_ocr_vision.aimodel`	DeepEncoder (SAM-ViT + CLIP-ViT cascade) → 100 visual tokens	fp16	762 MB
`decoder/unlimited_ocr_decoder.aimodel`	DeepseekV2 R-SWA MoE decoder, functions `prefill` + `decode` sharing one weight set + KV state	sym8	3.2 GB
`assets/embed_tokens.f16`	token embedding table `[129280,1280]` (host row-gather)	fp16	316 MB
`assets/{image_newline,view_seperator}.f16`, `assets/prompt_input_ids.i32`, `assets/recipe.json`	arrangement constants + the assembly recipe	—	tiny
`tokenizer/`	fast tokenizer (`tokenizer.json` + configs)	—	—

Pipeline (Base mode, 640px)

image → preprocess (pad to 640², normalize mean=std=0.5)
      → vision .aimodel                         → visual tokens [1,100,1280]
      → arrange (10×10 + image_newline per row + view_seperator) → [111,1280]
      → scatter into embed_tokens(prompt_ids)   → prefix [1,115,1280]
      → decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) → tokens
      → detokenize (keep special tokens)        → markdown

The exact, verified recipe is in assets/recipe.json. Reference implementations (Python end-to-end

a macOS app, CoreAIOCR, driving the stock runtime) are in the Core AI Model Zoo: conversion/unlimited_ocr/ and apps/CoreAIOCR/.

Notes

Appropriate input: clean single-page documents (invoice / paper / report / table / formula), roughly square or portrait, with text still legible when fit to 640². Very dense small-text scans (newspaper) want the tiled crop_mode vision export (not included here; Base mode only).
Prompt is fixed to document parsing (layout + structured extraction).
License: MIT (inherited from baidu/Unlimited-OCR).

Community port — not affiliated with Apple or baidu.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/Unlimited-OCR-CoreAI

Base model

baidu/Unlimited-OCR

Finetuned

(3)

this model