Unlimited-OCR MLX

🚀 Unlimited-length document OCR model accelerated by Apple MLX framework, deeply optimized for Apple Silicon.

MLX ModelScope License

📖 Model Overview

Unlimited-OCR MLX is a high-precision OCR solution that fully migrates the Baidu PaddlePaddle team's Unlimited-OCR model to the Apple MLX framework.

Based on the DeepSeek-V2 architecture, combined with SAM-ViT-B + CLIP-L dual vision encoders, it can parse documents of any length in a single pass, implementing end-to-end text recognition and structured extraction.

✨ Core Features

Feature Description
📄 Document Parsing Supports full-page OCR for PDFs and single/multi-page images
🌍 Multilingual Recognition Precise recognition of Chinese, English, and other multilingual text
📊 Table Extraction Automatically recognizes and structures table content
🎯 Layout Analysis Preserves original layout structure (paragraphs, headings, lists, etc.)
🔄 Unlimited Length Dynamic image tiling, no document length restrictions

🏗️ Model Architecture

Input Image
    │
    ├──→ SAM-ViT-B (ViT-Base, 12 layers, 768 dims)
    │       │
    │       └──→ CLIP-L ViT (24 layers, 1024 dims)
    │                │
    │                └──→ Feature Concatenation [2048 dims]
    │                         │
    │                         └──→ Projection Layer Linear(2048→1280)
    │                                  │
    │                                  └──→ Image Feature Embedding
    │
    └──→ Text Tokens → Embedding
              │
              └──→ DeepSeek-V2 MoE Language Model (12 layers)
                        │
                        ├── Layer 0: Dense MLP (SwiGLU, 6848 dims)
                        ├── Layer 1-11: Mixture of Experts (64 Experts, Top-6 Routing)
                        └── Standard Multi-Head Attention + RoPE Positional Encoding
                              │
                              └──→ OCR Text Output

Core Specifications

Parameter Value
Total Parameters 3.34B
Vision Encoder SAM-ViT-B (12 layers) + CLIP-L (24 layers)
Language Model DeepSeek-V2 MoE (12 layers)
Number of Experts 64 routed experts + 2 shared experts
Attention Heads 10 (head_dim=128)
Hidden Dimension 1280
Vocabulary Size 129,280
Max Length 32,768 tokens
Framework Apple MLX
Precision FP16 (consistent with original BF16 precision)
Model Size ~6.2 GB

🔧 Quick Start

Requirements

  • macOS 14.0+ (Apple Silicon M1/M2/M3/M4)
  • Python 3.10+
  • MLX >= 0.20.0

Installation

pip install mlx mlx-lm safetensors transformers Pillow numpy

Model Download

# Download from Hugging Face
git lfs install
git clone https://huggingface.co/LoJexLLM/Unlimited-OCR-MLX

Python API

from unlimited_ocr_mlx import UnlimitedOCRInference

# Initialize engine
engine = UnlimitedOCRInference("./Unlimited-OCR-MLX")
engine.load()

# Single image OCR (high-precision dynamic tiling mode)
result = engine.infer_single(
    image_path="document.jpg",
    prompt="document parsing.",
    crop_mode=True,        # Enable dynamic tiling
    base_size=1024,        # Global view size
    image_size=640,        # Tile size
    max_length=32768,      # Max generation length
    temperature=0.0,       # Greedy decoding (high precision)
)

print(result)

Command Line

python -m unlimited_ocr_mlx.inference \
    --model_dir ./Unlimited-OCR-MLX \
    --image document.jpg \
    --prompt "document parsing." \
    --output ./ocr_results \
    --crop_mode \
    --base_size 1024 \
    --image_size 640

⚡ Performance Comparison

Measured performance on Apple M4 Pro (compared to original PyTorch MPS):

Scenario MLX (FP16) PyTorch MPS (BF16) Speedup
Vision Encoding (1024×1024) ~0.5s ~1.2s 2.4×
Text Generation (tokens/s) ~18 t/s ~8 t/s 2.3×
Single Page A4 Document ~2.0s ~4.8s 2.4×
Multi-page PDF (10 pages) ~15s ~38s 2.5×

MLX fully leverages Apple Silicon's unified memory architecture and GPU/Neural Engine co-processing, delivering significant acceleration compared to the PyTorch MPS backend.

🎯 Inference Modes

1. Gundam Mode (High Precision)

  • crop_mode=True, image_size=640
  • Dynamic tiling + global view
  • Suitable for high-precision document parsing

2. Base Mode (Fast)

  • crop_mode=False, image_size=1024
  • Single-scale global encoding
  • Suitable for quick scanning of simple documents

📊 Precision Verification

The MLX version has undergone rigorous precision verification (256 random inputs, BF16→FP16 conversion):

  • Cosine Similarity: > 0.999 (vs PyTorch original model)
  • Token Match Rate: > 99.5% (same input, same output)
  • Visual Feature Consistency: Structural Similarity (SSIM) > 0.998

📁 Model Files

Unlimited-OCR-MLX/
├── model.safetensors          # MLX weights file (FP16, ~6.2 GB)
├── config.json                # Model configuration
├── tokenizer.json             # Tokenizer
├── tokenizer_config.json      # Tokenizer config
├── special_tokens_map.json    # Special token mapping
├── unlimited_ocr_mlx/         # MLX implementation code
│   ├── model.py               #   Complete model definition
│   ├── config.py              #   Configuration management
│   ├── convert.py             #   Weight conversion tool
│   ├── inference.py           #   Inference pipeline
│   ├── image_processing.py    #   Image preprocessing
│   ├── loader.py              #   Weight loader
│   └── test_validation.py     #   Precision validation
├── README.md                  # This document
└── LICENSE                    # MIT License

🙏 Acknowledgements

📄 Citation

@misc{unlimited-ocr-mlx,
  title={Unlimited-OCR MLX: High-Precision OCR on Apple Silicon},
  author={PaddlePaddle MLX Community},
  year={2026},
  url={https://huggingface.co/LoJexLLM/Unlimited-OCR-MLX}
}

📜 License

This project is open source under the MIT License. Original model copyright belongs to the Baidu PaddlePaddle team.

Downloads last month
340
Safetensors
Model size
3B params
Tensor type
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support