✨ Introduction

MinerU-Popo is a lightweight and universal framework for POst-Processing OCR outputs, bridging the gap between page-level OCR parsing and document-level semantic structure. It construct document tree structure based on with a 4B post-processing model performing four subtasks: table truncation analysis, text truncation analysis, title hierarchy analysis, and image-text association analysis. We handle the challenges of cross-page geometric discontinuity, redundant document parsing and scalability to long documents via:

Task-Oriented Data Engine: Generate representative training data and simplify the task-specific input.
Dynamic Chunking and Synchronization: Process long document by dynamic chunks and reduce deviations across chunks to preserve global consistency.
Document Enrichment: Structurally construct a tree, semantically generate summaries and split long-section nodes.

📊 Performance

Better Hierarchy (TEDS) after Post-Processing

Basic OCR	Before	After
MinerU	53.7	90.6
MonkeyOCR	48.9	87.4
Dolphin	60.4	83.5
PaddleOCR	59.3	82.6
GLM-OCR	53.5	81.8

Advantages Compared to Directly Using Pre-trained Model

Model	TEDS	Doc/s
MinerU-Popo	90.6	0.37
Qwen3-VL-2B	21.2	0.22
Qwen3-VL-4B	56.5	0.20
Qwen3-VL-8B	65.9	0.16
Qwen3-VL-32B	78.0	0.04

Benefits for Downstream Retrieval and Analysis (Acc on ViDoRe V3)

Method	C.S.	Fin.	H.R.	Ind.	Phar.
MinerU-Popo	84.4	49.5	66.8	58.7	71.6
Raw RAG	82.3	48.7	63.2	60.4	64.4
Visual RAG	80.7	58.4	64.8	59.7	67.6

⚙️ Setup

Please refer to https://github.com/opendatalab/MinerU-Popo

Downloads last month: 353

Safetensors

Model size

4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DreamEternal/MinerU-Popo

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

(293)

this model