Instructions to use sabafallah/Unlimited-OCR-Universal with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sabafallah/Unlimited-OCR-Universal with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="sabafallah/Unlimited-OCR-Universal", trust_remote_code=True)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("sabafallah/Unlimited-OCR-Universal", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sabafallah/Unlimited-OCR-Universal with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sabafallah/Unlimited-OCR-Universal"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sabafallah/Unlimited-OCR-Universal",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/sabafallah/Unlimited-OCR-Universal

SGLang

How to use sabafallah/Unlimited-OCR-Universal with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sabafallah/Unlimited-OCR-Universal" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sabafallah/Unlimited-OCR-Universal",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sabafallah/Unlimited-OCR-Universal" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sabafallah/Unlimited-OCR-Universal",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use sabafallah/Unlimited-OCR-Universal with Docker Model Runner:
```
docker model run hf.co/sabafallah/Unlimited-OCR-Universal
```

Baidu Inc.

Unlimited OCR Works

Welcome the Era of One-shot Long-horizon Parsing.

Unlimited OCR overview

Multi-device fork. This is a community fork of baidu/Unlimited-OCR (MIT) that also runs on Apple Silicon (MPS) and CPU, not just CUDA. The model weights are unchanged — only the inference code was made device-agnostic, plus a fix for a masked_scatter_ bug on the MPS backend and automatic fp32 on MPS (bf16 degrades there). See Transformers on Apple Silicon (MPS) for setup, or just run python demo/run_ocr.py image. Full credit for the model and research goes to the original authors.

Release

[2026/06/22] 🚀 We present Unlimited-OCR, aiming to push Deepseek-OCR one step further.
[2026/06/23] 🤝 Thanks to the ModelScope community for their support. Our model is now available at ModelScope.

Inference

Transformers

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.3 + CUDA12.9：

torch==2.10.0
torchvision==0.25.0
transformers==4.57.1
Pillow==12.1.1
matplotlib==3.10.8
einops==0.8.2
addict==2.4.0
easydict==1.13
pymupdf==1.27.2.2
psutil==7.2.2

import os
import torch
from transformers import AutoModel, AutoTokenizer

model_name = 'baidu/Unlimited-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
)
model = model.eval().cuda()

# ── Single image supports two configs: gundam or base ──
# gundam: base_size=1024, image_size=640, crop_mode=True
# base: base_size=1024, image_size=1024, crop_mode=False
model.infer(
    tokenizer,
    prompt='<image>document parsing.',
    image_file='your_image.jpg',
    output_path='your/output/dir',
    base_size=1024, image_size=640, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

# ── Multi page / PDF only uses base (image_size=1024) ──
model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=['page1.png', 'page2.png', 'page3.png'],
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

# ── PDF (convert pages to images, then multi-page parsing) ──
import tempfile, fitz  # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    paths = []
    for i, page in enumerate(doc):
        out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
        page.get_pixmap(matrix=mat).save(out)
        paths.append(out)
    doc.close()
    return paths

model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=pdf_to_images('your_doc.pdf', dpi=300),
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

Transformers on Apple Silicon (MPS)

The CUDA snippet above does not run as-is on a Mac — it hardcodes .cuda() and torch.autocast("cuda", ...). The model code in this repo has been patched to be device-agnostic, and this directory ships a pyproject.toml plus a ready-to-run demo (demo/run_ocr.py). The SGLang path below (--attention-backend fa3, FlashAttention-3) is CUDA-only and is not available on MPS; use this Transformers path instead.

Setup (uv-managed virtualenv — same pins as the CUDA list; the macOS arm64 torch/torchvision wheels ship with MPS):

uv venv --python 3.12
source .venv/bin/activate
uv pip install -r pyproject.toml

Run (the weights in this directory are loaded directly — no Hub re-download):

python demo/run_ocr.py image                            # bundled sample image
python demo/run_ocr.py image your_image.jpg --mode gundam   # or --mode base
python demo/run_ocr.py multi page1.png page2.png
python demo/run_ocr.py pdf your_doc.pdf --dpi 300

Or load it straight from the Hub with from_pretrained (downloads the weights and the patched trust_remote_code files — no clone needed):

import os
os.environ.setdefault("PYTORCH_ENABLE_MPS_FALLBACK", "1")  # CPU fallback for ops without an MPS kernel

import torch
from transformers import AutoModel, AutoTokenizer

model_id = "sabafallah/Unlimited-OCR-Universal"
if torch.cuda.is_available():
    device, dtype = "cuda", torch.bfloat16     # bf16 is correct (and fast) on CUDA
elif torch.backends.mps.is_available():
    device, dtype = "mps", torch.float32       # use fp32 on MPS -- bf16 degrades the output there
else:
    device, dtype = "cpu", torch.float32

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=dtype,
    attn_implementation="eager",      # only "mha_eager" is registered for this config
)
model = model.eval().to(device)

# Single image — gundam (crops; best for dense pages). For base: image_size=1024, crop_mode=False.
model.infer(
    tokenizer,
    prompt="<image>Free OCR.",        # or "<image>document parsing." for layout + boxes
    image_file="your_image.jpg",
    output_path="output",
    base_size=1024, image_size=640, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

Multi-page / PDF use model.infer_multi(...) exactly as in the CUDA example above (just keep the same dtype/device selection and attn_implementation="eager").

Three things differ from CUDA and are handled for you by the scripts / patched code:

Run in float32 on MPS, not bfloat16. On MPS, bf16 rounding is amplified by the MoE router (its top-6-of-64 expert selection flips under bf16), so the decode drifts into repeated garbage after a few tokens. fp32 reproduces the CUDA results exactly (~13 GB RAM). The scripts handle this automatically — they select torch.float32 on MPS/CPU and keep torch.bfloat16 on CUDA (the Hub snippet above does the same).
attn_implementation="eager". With use_mla=false, the DeepseekV2 decoder only registers a mha_eager attention class, so any other backend raises a KeyError.
masked_scatter_ is broken on MPS (torch 2.10 silently writes garbage), which left the image-token embeddings unfilled and made the model emit <eos> immediately. The repo code now injects image features with an index-assignment equivalent on MPS, while CUDA/CPU keep the original masked_scatter_. PYTORCH_ENABLE_MPS_FALLBACK=1 (set by the scripts) lets any op without an MPS kernel fall back to CPU instead of crashing.

SGLang

SGLang (--attention-backend fa3, FlashAttention-3) is CUDA-only and does not run on MPS. This fork does not bundle the SGLang wheel (to keep the repo slim); grab sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl from the original baidu/Unlimited-OCR repo.

Set up the environment (uv-managed virtualenv). Install the SGLang wheel first, then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion:

uv venv --python 3.12
source .venv/bin/activate

uv pip install sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2

Start the SGLang server:

python -m sglang.launch_server \
    --model baidu/Unlimited-OCR \
    --served-model-name Unlimited-OCR \
    --attention-backend fa3 \
    --page-size 1 \
    --mem-fraction-static 0.8 \
    --context-length 32768 \
    --enable-custom-logit-processor \
    --disable-overlap-schedule \
    --skip-server-warmup \
    --host 0.0.0.0 \
    --port 10000

Send streaming requests to the OpenAI-compatible API:

import base64
import json
import os
import tempfile

import fitz
import requests
from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

server_url = "http://127.0.0.1:10000"

session = requests.Session()
session.trust_env = False


def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    image_paths = []
    for i, page in enumerate(doc):
        image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")
        page.get_pixmap(matrix=mat).save(image_path)
        image_paths.append(image_path)
    doc.close()
    return image_paths


def encode_image(image_path):
    ext = os.path.splitext(image_path)[1].lower()
    mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}"
    with open(image_path, "rb") as f:
        data = base64.b64encode(f.read()).decode("utf-8")
    return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{data}"}}


def build_content(prompt, image_paths):
    return [{"type": "text", "text": prompt}] + [encode_image(path) for path in image_paths]


def generate(prompt, image_paths, image_mode, ngram_window):
    payload = {
        "model": "Unlimited-OCR",
        "messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
        "temperature": 0,
        "skip_special_tokens": False,
        "images_config": {"image_mode": image_mode},
        "custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
        "custom_params": {
            "ngram_size": 35,
            "window_size": ngram_window,
        },
        "stream": True,
    }
    response = session.post(
        f"{server_url}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload),
        timeout=1200,
        stream=True,
    )
    response.raise_for_status()

    chunks = []
    for line in response.iter_lines(chunk_size=1, decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        data = line[len("data: "):]
        if data == "[DONE]":
            break
        event = json.loads(data)
        delta = event["choices"][0].get("delta", {}).get("content", "")
        if delta:
            print(delta, end="", flush=True)
            chunks.append(delta)
    print()
    return "".join(chunks)


# Single image supports two configs: gundam or base. Example below uses gundam.
generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)

# Multi image (base only)
generate("Multi page parsing.", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)

# PDF (base only)
generate("Multi page parsing.", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)

Visualization

Acknowledgement

We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas.

Citation

Coming soon!

Downloads last month: 423

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sabafallah/Unlimited-OCR-Universal

Base model

baidu/Unlimited-OCR

Finetuned

(6)

this model