PaddleOCR-VL-1.5 · OpenVINO IR

Recognition VLM of the PaddleOCR-VL-1.5 document-parsing pipeline — converted to OpenVINO™ IR for local inference on Intel CPU / GPU / NPU

🔥 Official Website | 📝 Technical Report

Introduction · 简介

This repository hosts the OpenVINO™ IR build of the PaddleOCR-VL-1.5 recognition model — the ~0.9B multi-task vision-language model (VLM) that performs the actual content recognition in the PaddleOCR-VL-1.5 document-parsing pipeline. The original PaddlePaddle weights have been converted to OpenVINO Intermediate Representation so the model runs fully locally on Intel CPU, integrated/discrete GPU, and NPU via the OpenVINO runtime — no cloud service and no PaddlePaddle runtime required.

本仓库提供 PaddleOCR-VL-1.5 识别模型的 OpenVINO™ IR 版本——即 PaddleOCR-VL-1.5 文档解析流程中负责实际内容识别的约 0.9B 多任务视觉语言大模型（VLM）。原始 PaddlePaddle 权重已转换为 OpenVINO 中间表示，可在 Intel CPU / 集显 / 独显 / NPU 上完全本地运行，无需联网、无需安装 PaddlePaddle。

It recognizes text, tables, formulas and charts (mixed Chinese/English) from cropped document regions and outputs markdown-style content. In an end-to-end pipeline it is paired with a layout model (FionaGu1019/PP-DocLayoutV3-ov) that provides region detection and reading order.

What's in this repo · 文件说明

The model is exported as a multi-stage OpenVINO pipeline:

Stage	Files	Role
Vision encoder	`vision_encoder.xml` / `.bin`	Patch embed + transformer encoder over image patches
Projector	`projector.xml` / `.bin`	Projects visual features into the text embedding space
Text embedding	`text_embed.xml` / `.bin`	Token → embedding
Text decoder	`text_decoder.xml` / `.bin`	Autoregressive decoder (stateful KV-cache)
LM head	`lm_head.xml` / `.bin`	Hidden state → vocabulary logits
Tokenizer	`tokenizer.xml` / `.bin`, `detokenizer.xml` / `.bin`	OpenVINO Tokenizers (no `sentencepiece` needed)
Aux	`position_embedding.npy`, `config.json`, `preprocessor_config.json`, `tokenizer_config.json`, `added_tokens.json`	Pos-embed table + configs

Type: multi-task recognition VLM (~0.9B)
Tasks: ocr (text), table, formula, chart
Tokenization: OpenVINO Tokenizers (the sentencepiece/torch tokenizer dependency is removed)
Decoder: stateful (on-device KV cache) for faster decoding

Usage · 使用方法

Recommended — the PaddleOCR-VL OpenVINO end-to-end pipeline

This is a multi-stage VLM (vision encode → projector → embed → decode-with-KV → lm-head), so the easiest way to use it is through the end-to-end pipeline, paired with the layout model. Download both:

from modelscope import snapshot_download

vl_dir     = snapshot_download("FionaGu1019/PaddleOCR-VL-1.5-ov")   # this model (recognition)
layout_dir = snapshot_download("FionaGu1019/PP-DocLayoutV3-ov")     # layout detection

The pipeline then runs: layout detection (PP-DocLayoutV3) → crop regions in reading order → recognition (this model) → markdown output. It supports image / multi-page TIFF / PDF / directory / URL inputs and runs on the Intel GPU by default (CPU fallback).

Loading the stages with OpenVINO

import openvino as ov
import openvino_tokenizers  # registers custom tokenizer ops; import before Core()

core = ov.Core()
core.set_property({"CACHE_DIR": ".ov_cache"})   # cache compiled kernels (big GPU speedup)
device = "GPU"                                    # "CPU" / "GPU" / "NPU"

md = "PaddleOCR-VL-1.5-ov"
vision_encoder = core.compile_model(f"{md}/vision_encoder.xml", device)
projector      = core.compile_model(f"{md}/projector.xml", device)
text_embed     = core.compile_model(f"{md}/text_embed.xml", device)
text_decoder   = core.compile_model(f"{md}/text_decoder.xml", device)   # stateful KV cache
lm_head        = core.compile_model(f"{md}/lm_head.xml", device)
tokenizer      = core.compile_model(f"{md}/tokenizer.xml", device)
detokenizer    = core.compile_model(f"{md}/detokenizer.xml", device)
# Orchestration (preprocess, pos-embed interpolation, 2x2 spatial merge, greedy decode loop)
# is non-trivial — please reuse the reference pipeline rather than re-implementing it.

Notes · 说明

This is a format conversion of the official PaddlePaddle model to OpenVINO IR; recognition behaviour is intended to match the original PaddleOCR-VL-1.5. For the original weights see PaddlePaddle/PaddleOCR-VL-1.5.
The tokenizer is provided as OpenVINO Tokenizers IR, so inference needs only openvino + openvino-tokenizers (no sentencepiece).
For best results across Intel hardware, prefer GPU when available and fall back to CPU.

Citation

If you find PaddleOCR-VL helpful, feel free to give the original project a star and citation.

@misc{cui2026paddleocrvl15multitask09bvlm,
      title={PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing},
      author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
      year={2026},
      eprint={2601.21957},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.21957},
}

Downloads last month: 6

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Fiona1019/PaddleOCR-VL-1.5-ov

Base model

baidu/ERNIE-4.5-0.3B-Paddle

Finetuned

PaddlePaddle/PaddleOCR-VL-1.5

Quantized

(11)

this model

Paper for Fiona1019/PaddleOCR-VL-1.5-ov

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Paper • 2601.21957 • Published Jan 29 • 23