Update README.md
Browse files
README.md
CHANGED
|
@@ -1,22 +1,76 @@
|
|
| 1 |
# Infinity-Parser-7B
|
| 2 |
|
| 3 |
-
<
|
| 4 |
-
|
| 5 |
-
<a href="https://
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
# Introduction
|
| 8 |
|
| 9 |
We develop Infinity-Parser, an end-to-end scanned document parsing model trained with reinforcement learning. By incorporating verifiable rewards based on layout and content, Infinity-Parser maintains the original document's structure and content with high fidelity. Extensive evaluations on benchmarks in cluding OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models while preserving the model’s general multimodal understanding capability.
|
| 10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
# Architecture
|
| 12 |
|
| 13 |
Overview of Infinity-Parser training framework. Our model is optimized via reinforcement finetuning with edit distance, layout, and order-based rewards.
|
| 14 |
|
| 15 |

|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
# Quick Start
|
| 18 |
|
| 19 |
-
## Inference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
```python
|
| 22 |
import torch
|
|
@@ -26,7 +80,7 @@ from qwen_vl_utils import process_vision_info
|
|
| 26 |
model_path = "infly/Infinity-Parser-7B"
|
| 27 |
prompt = "Please transform the document’s contents into Markdown format."
|
| 28 |
|
| 29 |
-
print(
|
| 30 |
# Default: Load the model on the available device(s)
|
| 31 |
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 32 |
# model_path, torch_dtype="auto", device_map="auto"
|
|
@@ -48,7 +102,7 @@ min_pixels = 256 * 28 * 28 # 448 * 448
|
|
| 48 |
max_pixels = 2304 * 28 * 28 # 1344 * 1344
|
| 49 |
processor = AutoProcessor.from_pretrained(model_path, min_pixels=min_pixels, max_pixels=max_pixels)
|
| 50 |
|
| 51 |
-
print(
|
| 52 |
messages = [
|
| 53 |
{
|
| 54 |
"role": "user",
|
|
@@ -75,7 +129,7 @@ inputs = processor(
|
|
| 75 |
)
|
| 76 |
inputs = inputs.to("cuda")
|
| 77 |
|
| 78 |
-
print(
|
| 79 |
generated_ids = model.generate(**inputs, max_new_tokens=4096)
|
| 80 |
generated_ids_trimmed = [
|
| 81 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
|
@@ -86,9 +140,16 @@ output_text = processor.batch_decode(
|
|
| 86 |
print(output_text)
|
| 87 |
```
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
# Citation
|
| 90 |
|
| 91 |
-
```
|
| 92 |
@misc{wang2025infinityparserlayoutaware,
|
| 93 |
title={Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing},
|
| 94 |
author={Baode Wang and Biao Wu and Weizhen Li and Meng Fang and Zuming Huang and Jun Huang and Haozhe Wang and Yanjie Liang and Ling Chen and Wei Chu and Yuan Qi},
|
|
|
|
| 1 |
# Infinity-Parser-7B
|
| 2 |
|
| 3 |
+
<div align="left">
|
| 4 |
+
|
| 5 |
+
💻 <a href="https://github.com/infly-ai/INF-MLLM/tree/main/Infinity-Parser">Model</a> |
|
| 6 |
+
📊 <a href="https://huggingface.co/datasets/infly/Infinity-Doc-55K">Dataset</a> |
|
| 7 |
+
📄 <a href="https://arxiv.org/pdf/2506.03197">Paper</a> |
|
| 8 |
+
🚀 <a href="https://huggingface.co/spaces/infly/Infinity-Parser-Demo">Demo</a>
|
| 9 |
+
|
| 10 |
+
</div>
|
| 11 |
|
| 12 |
# Introduction
|
| 13 |
|
| 14 |
We develop Infinity-Parser, an end-to-end scanned document parsing model trained with reinforcement learning. By incorporating verifiable rewards based on layout and content, Infinity-Parser maintains the original document's structure and content with high fidelity. Extensive evaluations on benchmarks in cluding OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models while preserving the model’s general multimodal understanding capability.
|
| 15 |
|
| 16 |
+
## Key Features
|
| 17 |
+
|
| 18 |
+
- LayoutRL Framework: a reinforcement learning framework that explicitly trains models to be layout-aware through verifiable multi-aspect rewards combining edit distance, paragraph accuracy, and reading order preservation.
|
| 19 |
+
|
| 20 |
+
- Infinity-Doc-400K Dataset: a large-scale dataset of 400K scanned documents that integrates high-quality synthetic data with diverse real-world samples, featuring rich layout variations and comprehensive structural annotations.
|
| 21 |
+
|
| 22 |
+
- Infinity-Parser Model: a VLM-based parser that achieves new state-of-the-art performance on OCR, table and formula extraction, and reading-order detection benchmarks in both English and Chinese, while maintaining nearly the same general multimodal understanding capability as the base model.
|
| 23 |
+
|
| 24 |
# Architecture
|
| 25 |
|
| 26 |
Overview of Infinity-Parser training framework. Our model is optimized via reinforcement finetuning with edit distance, layout, and order-based rewards.
|
| 27 |
|
| 28 |

|
| 29 |
|
| 30 |
+
# Performance
|
| 31 |
+
|
| 32 |
+
## olmOCR-bench
|
| 33 |
+

|
| 34 |
+
|
| 35 |
+
## OmniDocBench
|
| 36 |
+

|
| 37 |
+
|
| 38 |
+
## Table Recognition
|
| 39 |
+

|
| 40 |
+
|
| 41 |
# Quick Start
|
| 42 |
|
| 43 |
+
## Vllm Inference
|
| 44 |
+
We recommend using the vLLM backend for accelerated inference.
|
| 45 |
+
It supports image and PDF inputs, automatically parses the document content, and exports the results in Markdown format to a specified directory.
|
| 46 |
+
|
| 47 |
+
Before starting, make sure that **PyTorch** is correctly installed according to the official installation guide at [https://pytorch.org/](https://pytorch.org/).
|
| 48 |
+
|
| 49 |
+
```shell
|
| 50 |
+
pip install .
|
| 51 |
+
|
| 52 |
+
parser --model /path/model --input dir/PDF/Image --output output_folders --batch_size 128 --tp 1
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
Adjust the tensor parallelism (tp) value — 1, 2, or 4 — and the batch size according to the number of GPUs and the available memory.
|
| 56 |
+
|
| 57 |
+
<details>
|
| 58 |
+
<summary> [The information of result folder] </summary>
|
| 59 |
+
The result folder contains the following contents:
|
| 60 |
+
|
| 61 |
+
```
|
| 62 |
+
output_folders/
|
| 63 |
+
├── <file_name>/output.md
|
| 64 |
+
├── ...
|
| 65 |
+
├── ...
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
</details>
|
| 69 |
+
|
| 70 |
+
## Using Transformers to Inference
|
| 71 |
+
|
| 72 |
+
<details>
|
| 73 |
+
<summary> Transformers Inference Example </summary>
|
| 74 |
|
| 75 |
```python
|
| 76 |
import torch
|
|
|
|
| 80 |
model_path = "infly/Infinity-Parser-7B"
|
| 81 |
prompt = "Please transform the document’s contents into Markdown format."
|
| 82 |
|
| 83 |
+
print("Loading model and processor...")
|
| 84 |
# Default: Load the model on the available device(s)
|
| 85 |
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 86 |
# model_path, torch_dtype="auto", device_map="auto"
|
|
|
|
| 102 |
max_pixels = 2304 * 28 * 28 # 1344 * 1344
|
| 103 |
processor = AutoProcessor.from_pretrained(model_path, min_pixels=min_pixels, max_pixels=max_pixels)
|
| 104 |
|
| 105 |
+
print("Preparing messages for inference...")
|
| 106 |
messages = [
|
| 107 |
{
|
| 108 |
"role": "user",
|
|
|
|
| 129 |
)
|
| 130 |
inputs = inputs.to("cuda")
|
| 131 |
|
| 132 |
+
print("Generating results...")
|
| 133 |
generated_ids = model.generate(**inputs, max_new_tokens=4096)
|
| 134 |
generated_ids_trimmed = [
|
| 135 |
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
|
|
|
| 140 |
print(output_text)
|
| 141 |
```
|
| 142 |
|
| 143 |
+
</details>
|
| 144 |
+
|
| 145 |
+
# Visualization
|
| 146 |
+
|
| 147 |
+
## Comparison Examples
|
| 148 |
+

|
| 149 |
+
|
| 150 |
# Citation
|
| 151 |
|
| 152 |
+
```
|
| 153 |
@misc{wang2025infinityparserlayoutaware,
|
| 154 |
title={Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing},
|
| 155 |
author={Baode Wang and Biao Wu and Weizhen Li and Meng Fang and Zuming Huang and Jun Huang and Haozhe Wang and Yanjie Liang and Ling Chen and Wei Chu and Yuan Qi},
|