Unlimited-OCR MLX

🚀 Unlimited-length document OCR model accelerated by Apple MLX framework, deeply optimized for Apple Silicon.

📖 Model Overview

Unlimited-OCR MLX is a high-precision OCR solution that fully migrates the Baidu PaddlePaddle team's Unlimited-OCR model to the Apple MLX framework.

Based on the DeepSeek-V2 architecture, combined with SAM-ViT-B + CLIP-L dual vision encoders, it can parse documents of any length in a single pass, implementing end-to-end text recognition and structured extraction.

✨ Core Features

Feature	Description
📄 Document Parsing	Supports full-page OCR for PDFs and single/multi-page images
🌍 Multilingual Recognition	Precise recognition of Chinese, English, and other multilingual text
📊 Table Extraction	Automatically recognizes and structures table content
🎯 Layout Analysis	Preserves original layout structure (paragraphs, headings, lists, etc.)
🔄 Unlimited Length	Dynamic image tiling, no document length restrictions

🏗️ Model Architecture

Input Image
    │
    ├──→ SAM-ViT-B (ViT-Base, 12 layers, 768 dims)
    │       │
    │       └──→ CLIP-L ViT (24 layers, 1024 dims)
    │                │
    │                └──→ Feature Concatenation [2048 dims]
    │                         │
    │                         └──→ Projection Layer Linear(2048→1280)
    │                                  │
    │                                  └──→ Image Feature Embedding
    │
    └──→ Text Tokens → Embedding
              │
              └──→ DeepSeek-V2 MoE Language Model (12 layers)
                        │
                        ├── Layer 0: Dense MLP (SwiGLU, 6848 dims)
                        ├── Layer 1-11: Mixture of Experts (64 Experts, Top-6 Routing)
                        └── Standard Multi-Head Attention + RoPE Positional Encoding
                              │
                              └──→ OCR Text Output

Core Specifications

Parameter	Value
Total Parameters	3.34B
Vision Encoder	SAM-ViT-B (12 layers) + CLIP-L (24 layers)
Language Model	DeepSeek-V2 MoE (12 layers)
Number of Experts	64 routed experts + 2 shared experts
Attention Heads	10 (head_dim=128)
Hidden Dimension	1280
Vocabulary Size	129,280
Max Length	32,768 tokens
Framework	Apple MLX
Precision	FP16 (consistent with original BF16 precision)
Model Size	~6.2 GB

🔧 Quick Start

Requirements

macOS 14.0+ (Apple Silicon M1/M2/M3/M4)
Python 3.10+
MLX >= 0.20.0

Installation

pip install mlx mlx-lm safetensors transformers Pillow numpy

Model Download

# Download from Hugging Face
git lfs install
git clone https://huggingface.co/LoJexLLM/Unlimited-OCR-MLX

Python API

from unlimited_ocr_mlx import UnlimitedOCRInference

# Initialize engine
engine = UnlimitedOCRInference("./Unlimited-OCR-MLX")
engine.load()

# Single image OCR (high-precision dynamic tiling mode)
result = engine.infer_single(
    image_path="document.jpg",
    prompt="document parsing.",
    crop_mode=True,        # Enable dynamic tiling
    base_size=1024,        # Global view size
    image_size=640,        # Tile size
    max_length=32768,      # Max generation length
    temperature=0.0,       # Greedy decoding (high precision)
)

print(result)

Command Line

python -m unlimited_ocr_mlx.inference \
    --model_dir ./Unlimited-OCR-MLX \
    --image document.jpg \
    --prompt "document parsing." \
    --output ./ocr_results \
    --crop_mode \
    --base_size 1024 \
    --image_size 640

⚡ Performance Comparison

Measured performance on Apple M4 Pro (compared to original PyTorch MPS):

Scenario	MLX (FP16)	PyTorch MPS (BF16)	Speedup
Vision Encoding (1024×1024)	~0.5s	~1.2s	2.4×
Text Generation (tokens/s)	~18 t/s	~8 t/s	2.3×
Single Page A4 Document	~2.0s	~4.8s	2.4×
Multi-page PDF (10 pages)	~15s	~38s	2.5×

MLX fully leverages Apple Silicon's unified memory architecture and GPU/Neural Engine co-processing, delivering significant acceleration compared to the PyTorch MPS backend.

🎯 Inference Modes

1. Gundam Mode (High Precision)

crop_mode=True, image_size=640
Dynamic tiling + global view
Suitable for high-precision document parsing

2. Base Mode (Fast)

crop_mode=False, image_size=1024
Single-scale global encoding
Suitable for quick scanning of simple documents

📊 Precision Verification

The MLX version has undergone rigorous precision verification (256 random inputs, BF16→FP16 conversion):

Cosine Similarity: > 0.999 (vs PyTorch original model)
Token Match Rate: > 99.5% (same input, same output)
Visual Feature Consistency: Structural Similarity (SSIM) > 0.998

📁 Model Files

Unlimited-OCR-MLX/
├── model.safetensors          # MLX weights file (FP16, ~6.2 GB)
├── config.json                # Model configuration
├── tokenizer.json             # Tokenizer
├── tokenizer_config.json      # Tokenizer config
├── special_tokens_map.json    # Special token mapping
├── unlimited_ocr_mlx/         # MLX implementation code
│   ├── model.py               #   Complete model definition
│   ├── config.py              #   Configuration management
│   ├── convert.py             #   Weight conversion tool
│   ├── inference.py           #   Inference pipeline
│   ├── image_processing.py    #   Image preprocessing
│   ├── loader.py              #   Weight loader
│   └── test_validation.py     #   Precision validation
├── README.md                  # This document
└── LICENSE                    # MIT License

🙏 Acknowledgements

Original model: PaddlePaddle/Unlimited-OCR
Baidu PaddlePaddle Team
DeepSeek-OCR — Base architecture
Apple MLX — Inference acceleration framework

📄 Citation

@misc{unlimited-ocr-mlx,
  title={Unlimited-OCR MLX: High-Precision OCR on Apple Silicon},
  author={PaddlePaddle MLX Community},
  year={2026},
  url={https://huggingface.co/LoJexLLM/Unlimited-OCR-MLX}
}

📜 License

This project is open source under the MIT License. Original model copyright belongs to the Baidu PaddlePaddle team.

Downloads last month: 340

Safetensors

Model size

3B params

Tensor type

F16

MLX

Hardware compatibility

Quantized