Instructions to use Fiona1019/PaddleOCR-VL-1.5-ov with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PaddleOCR
How to use Fiona1019/PaddleOCR-VL-1.5-ov with PaddleOCR:
# Please refer to the document for information on how to use the model. # https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/module_usage/module_overview.html
- Notebooks
- Google Colab
- Kaggle
PaddleOCR-VL-1.5 · OpenVINO IR
Recognition VLM of the PaddleOCR-VL-1.5 document-parsing pipeline — converted to OpenVINO™ IR for local inference on Intel CPU / GPU / NPU
Introduction · 简介
This repository hosts the OpenVINO™ IR build of the PaddleOCR-VL-1.5 recognition model — the ~0.9B multi-task vision-language model (VLM) that performs the actual content recognition in the PaddleOCR-VL-1.5 document-parsing pipeline. The original PaddlePaddle weights have been converted to OpenVINO Intermediate Representation so the model runs fully locally on Intel CPU, integrated/discrete GPU, and NPU via the OpenVINO runtime — no cloud service and no PaddlePaddle runtime required.
本仓库提供 PaddleOCR-VL-1.5 识别模型的 OpenVINO™ IR 版本——即 PaddleOCR-VL-1.5 文档解析流程中 负责实际内容识别的约 0.9B 多任务视觉语言大模型(VLM)。原始 PaddlePaddle 权重已转换为 OpenVINO 中间表示, 可在 Intel CPU / 集显 / 独显 / NPU 上完全本地运行,无需联网、无需安装 PaddlePaddle。
It recognizes text, tables, formulas and charts (mixed Chinese/English) from cropped document
regions and outputs markdown-style content. In an end-to-end pipeline it is paired with a layout model
(FionaGu1019/PP-DocLayoutV3-ov) that
provides region detection and reading order.
What's in this repo · 文件说明
The model is exported as a multi-stage OpenVINO pipeline:
| Stage | Files | Role |
|---|---|---|
| Vision encoder | vision_encoder.xml / .bin |
Patch embed + transformer encoder over image patches |
| Projector | projector.xml / .bin |
Projects visual features into the text embedding space |
| Text embedding | text_embed.xml / .bin |
Token → embedding |
| Text decoder | text_decoder.xml / .bin |
Autoregressive decoder (stateful KV-cache) |
| LM head | lm_head.xml / .bin |
Hidden state → vocabulary logits |
| Tokenizer | tokenizer.xml / .bin, detokenizer.xml / .bin |
OpenVINO Tokenizers (no sentencepiece needed) |
| Aux | position_embedding.npy, config.json, preprocessor_config.json, tokenizer_config.json, added_tokens.json |
Pos-embed table + configs |
- Type: multi-task recognition VLM (~0.9B)
- Tasks:
ocr(text),table,formula,chart - Tokenization: OpenVINO Tokenizers (the
sentencepiece/torchtokenizer dependency is removed) - Decoder: stateful (on-device KV cache) for faster decoding
Usage · 使用方法
Recommended — the PaddleOCR-VL OpenVINO end-to-end pipeline
This is a multi-stage VLM (vision encode → projector → embed → decode-with-KV → lm-head), so the easiest way to use it is through the end-to-end pipeline, paired with the layout model. Download both:
from modelscope import snapshot_download
vl_dir = snapshot_download("FionaGu1019/PaddleOCR-VL-1.5-ov") # this model (recognition)
layout_dir = snapshot_download("FionaGu1019/PP-DocLayoutV3-ov") # layout detection
The pipeline then runs: layout detection (PP-DocLayoutV3) → crop regions in reading order → recognition (this model) → markdown output. It supports image / multi-page TIFF / PDF / directory / URL inputs and runs on the Intel GPU by default (CPU fallback).
Loading the stages with OpenVINO
import openvino as ov
import openvino_tokenizers # registers custom tokenizer ops; import before Core()
core = ov.Core()
core.set_property({"CACHE_DIR": ".ov_cache"}) # cache compiled kernels (big GPU speedup)
device = "GPU" # "CPU" / "GPU" / "NPU"
md = "PaddleOCR-VL-1.5-ov"
vision_encoder = core.compile_model(f"{md}/vision_encoder.xml", device)
projector = core.compile_model(f"{md}/projector.xml", device)
text_embed = core.compile_model(f"{md}/text_embed.xml", device)
text_decoder = core.compile_model(f"{md}/text_decoder.xml", device) # stateful KV cache
lm_head = core.compile_model(f"{md}/lm_head.xml", device)
tokenizer = core.compile_model(f"{md}/tokenizer.xml", device)
detokenizer = core.compile_model(f"{md}/detokenizer.xml", device)
# Orchestration (preprocess, pos-embed interpolation, 2x2 spatial merge, greedy decode loop)
# is non-trivial — please reuse the reference pipeline rather than re-implementing it.
Notes · 说明
- This is a format conversion of the official PaddlePaddle model to OpenVINO IR; recognition behaviour is intended to match the original PaddleOCR-VL-1.5. For the original weights see PaddlePaddle/PaddleOCR-VL-1.5.
- The tokenizer is provided as OpenVINO Tokenizers IR, so inference needs only
openvino+openvino-tokenizers(nosentencepiece). - For best results across Intel hardware, prefer GPU when available and fall back to CPU.
Citation
If you find PaddleOCR-VL helpful, feel free to give the original project a star and citation.
@misc{cui2026paddleocrvl15multitask09bvlm,
title={PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing},
author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
year={2026},
eprint={2601.21957},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.21957},
}
- Downloads last month
- 6
Model tree for Fiona1019/PaddleOCR-VL-1.5-ov
Base model
baidu/ERNIE-4.5-0.3B-Paddle