Instructions to use LoJexLLM/Unlimited-OCR-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use LoJexLLM/Unlimited-OCR-MLX with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Unlimited-OCR-MLX LoJexLLM/Unlimited-OCR-MLX
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Unlimited-OCR MLX
🚀 Unlimited-length document OCR model accelerated by Apple MLX framework, deeply optimized for Apple Silicon.
📖 Model Overview
Unlimited-OCR MLX is a high-precision OCR solution that fully migrates the Baidu PaddlePaddle team's Unlimited-OCR model to the Apple MLX framework.
Based on the DeepSeek-V2 architecture, combined with SAM-ViT-B + CLIP-L dual vision encoders, it can parse documents of any length in a single pass, implementing end-to-end text recognition and structured extraction.
✨ Core Features
| Feature | Description |
|---|---|
| 📄 Document Parsing | Supports full-page OCR for PDFs and single/multi-page images |
| 🌍 Multilingual Recognition | Precise recognition of Chinese, English, and other multilingual text |
| 📊 Table Extraction | Automatically recognizes and structures table content |
| 🎯 Layout Analysis | Preserves original layout structure (paragraphs, headings, lists, etc.) |
| 🔄 Unlimited Length | Dynamic image tiling, no document length restrictions |
🏗️ Model Architecture
Input Image
│
├──→ SAM-ViT-B (ViT-Base, 12 layers, 768 dims)
│ │
│ └──→ CLIP-L ViT (24 layers, 1024 dims)
│ │
│ └──→ Feature Concatenation [2048 dims]
│ │
│ └──→ Projection Layer Linear(2048→1280)
│ │
│ └──→ Image Feature Embedding
│
└──→ Text Tokens → Embedding
│
└──→ DeepSeek-V2 MoE Language Model (12 layers)
│
├── Layer 0: Dense MLP (SwiGLU, 6848 dims)
├── Layer 1-11: Mixture of Experts (64 Experts, Top-6 Routing)
└── Standard Multi-Head Attention + RoPE Positional Encoding
│
└──→ OCR Text Output
Core Specifications
| Parameter | Value |
|---|---|
| Total Parameters | 3.34B |
| Vision Encoder | SAM-ViT-B (12 layers) + CLIP-L (24 layers) |
| Language Model | DeepSeek-V2 MoE (12 layers) |
| Number of Experts | 64 routed experts + 2 shared experts |
| Attention Heads | 10 (head_dim=128) |
| Hidden Dimension | 1280 |
| Vocabulary Size | 129,280 |
| Max Length | 32,768 tokens |
| Framework | Apple MLX |
| Precision | FP16 (consistent with original BF16 precision) |
| Model Size | ~6.2 GB |
🔧 Quick Start
Requirements
- macOS 14.0+ (Apple Silicon M1/M2/M3/M4)
- Python 3.10+
- MLX >= 0.20.0
Installation
pip install mlx mlx-lm safetensors transformers Pillow numpy
Model Download
# Download from Hugging Face
git lfs install
git clone https://huggingface.co/LoJexLLM/Unlimited-OCR-MLX
Python API
from unlimited_ocr_mlx import UnlimitedOCRInference
# Initialize engine
engine = UnlimitedOCRInference("./Unlimited-OCR-MLX")
engine.load()
# Single image OCR (high-precision dynamic tiling mode)
result = engine.infer_single(
image_path="document.jpg",
prompt="document parsing.",
crop_mode=True, # Enable dynamic tiling
base_size=1024, # Global view size
image_size=640, # Tile size
max_length=32768, # Max generation length
temperature=0.0, # Greedy decoding (high precision)
)
print(result)
Command Line
python -m unlimited_ocr_mlx.inference \
--model_dir ./Unlimited-OCR-MLX \
--image document.jpg \
--prompt "document parsing." \
--output ./ocr_results \
--crop_mode \
--base_size 1024 \
--image_size 640
⚡ Performance Comparison
Measured performance on Apple M4 Pro (compared to original PyTorch MPS):
| Scenario | MLX (FP16) | PyTorch MPS (BF16) | Speedup |
|---|---|---|---|
| Vision Encoding (1024×1024) | ~0.5s | ~1.2s | 2.4× |
| Text Generation (tokens/s) | ~18 t/s | ~8 t/s | 2.3× |
| Single Page A4 Document | ~2.0s | ~4.8s | 2.4× |
| Multi-page PDF (10 pages) | ~15s | ~38s | 2.5× |
MLX fully leverages Apple Silicon's unified memory architecture and GPU/Neural Engine co-processing, delivering significant acceleration compared to the PyTorch MPS backend.
🎯 Inference Modes
1. Gundam Mode (High Precision)
crop_mode=True, image_size=640- Dynamic tiling + global view
- Suitable for high-precision document parsing
2. Base Mode (Fast)
crop_mode=False, image_size=1024- Single-scale global encoding
- Suitable for quick scanning of simple documents
📊 Precision Verification
The MLX version has undergone rigorous precision verification (256 random inputs, BF16→FP16 conversion):
- Cosine Similarity: > 0.999 (vs PyTorch original model)
- Token Match Rate: > 99.5% (same input, same output)
- Visual Feature Consistency: Structural Similarity (SSIM) > 0.998
📁 Model Files
Unlimited-OCR-MLX/
├── model.safetensors # MLX weights file (FP16, ~6.2 GB)
├── config.json # Model configuration
├── tokenizer.json # Tokenizer
├── tokenizer_config.json # Tokenizer config
├── special_tokens_map.json # Special token mapping
├── unlimited_ocr_mlx/ # MLX implementation code
│ ├── model.py # Complete model definition
│ ├── config.py # Configuration management
│ ├── convert.py # Weight conversion tool
│ ├── inference.py # Inference pipeline
│ ├── image_processing.py # Image preprocessing
│ ├── loader.py # Weight loader
│ └── test_validation.py # Precision validation
├── README.md # This document
└── LICENSE # MIT License
🙏 Acknowledgements
- Original model: PaddlePaddle/Unlimited-OCR
- Baidu PaddlePaddle Team
- DeepSeek-OCR — Base architecture
- Apple MLX — Inference acceleration framework
📄 Citation
@misc{unlimited-ocr-mlx,
title={Unlimited-OCR MLX: High-Precision OCR on Apple Silicon},
author={PaddlePaddle MLX Community},
year={2026},
url={https://huggingface.co/LoJexLLM/Unlimited-OCR-MLX}
}
📜 License
This project is open source under the MIT License. Original model copyright belongs to the Baidu PaddlePaddle team.
- Downloads last month
- 340
Quantized