docs: Readme Updated for optimized Usage with transformers library

#60
by sayed99 - opened
Files changed (1) hide show
  1. README.md +99 -16
README.md CHANGED
@@ -72,9 +72,11 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
72
 
73
 
74
  ## News
75
- * ```2025.10.16``` 🚀 We release [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), — a multilingual documents parsing via a 0.9B Ultra-Compact Vision-Language Model with SOTA performance.
76
- * ```2025.10.29``` Supports calling the core module PaddleOCR-VL-0.9B of PaddleOCR-VL via the `transformers` library.
77
 
 
 
 
 
78
 
79
  ## Usage
80
 
@@ -113,15 +115,25 @@ for res in output:
113
 
114
  ### Accelerate VLM Inference via Optimized Inference Servers
115
 
116
- 1. Start the VLM inference server (the default port is `8080`):
117
 
118
- ```bash
119
- docker run \
120
- --rm \
121
- --gpus all \
122
- --network host \
123
- ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server
124
- ```
 
 
 
 
 
 
 
 
 
 
125
  2. Call the PaddleOCR CLI or Python API:
126
 
127
  ```bash
@@ -130,6 +142,7 @@ for res in output:
130
  --vl_rec_backend vllm-server \
131
  --vl_rec_server_url http://127.0.0.1:8080/v1
132
  ```
 
133
  ```python
134
  from paddleocr import PaddleOCRVL
135
  pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
@@ -154,9 +167,14 @@ from PIL import Image
154
  import torch
155
  from transformers import AutoModelForCausalLM, AutoProcessor
156
 
 
 
 
 
 
 
157
  DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
158
 
159
- CHOSEN_TASK = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula'
160
  PROMPTS = {
161
  "ocr": "OCR:",
162
  "table": "Table Recognition:",
@@ -164,8 +182,6 @@ PROMPTS = {
164
  "chart": "Chart Recognition:",
165
  }
166
 
167
- model_path = "PaddlePaddle/PaddleOCR-VL"
168
- image_path = "test.png"
169
  image = Image.open(image_path).convert("RGB")
170
 
171
  model = AutoModelForCausalLM.from_pretrained(
@@ -177,7 +193,7 @@ messages = [
177
  {"role": "user",
178
  "content": [
179
  {"type": "image", "image": image},
180
- {"type": "text", "text": PROMPTS[CHOSEN_TASK]},
181
  ]
182
  }
183
  ]
@@ -186,7 +202,7 @@ inputs = processor.apply_chat_template(
186
  tokenize=True,
187
  add_generation_prompt=True,
188
  return_dict=True,
189
- return_tensors="pt"
190
  ).to(DEVICE)
191
 
192
  outputs = model.generate(**inputs, max_new_tokens=1024)
@@ -194,6 +210,73 @@ outputs = processor.batch_decode(outputs, skip_special_tokens=True)[0]
194
  print(outputs)
195
  ```
196
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  ## Performance
198
 
199
  ### Page-Level Document Parsing
@@ -346,4 +429,4 @@ If you find PaddleOCR-VL helpful, feel free to give us a star and citation.
346
  primaryClass={cs.CV},
347
  url={https://arxiv.org/abs/2510.14528},
348
  }
349
- ```
 
72
 
73
 
74
  ## News
 
 
75
 
76
+ * ```2025.11.07``` 🚀 Enabled `flash-attn` in the `transformers` library to achieve faster inference with PaddleOCR-VL-0.9B.
77
+ * ```2025.11.04``` 🌟 PaddleOCR-VL-0.9B is now officially supported on `vLLM` .
78
+ * ```2025.10.29``` 🤗 Supports calling the core module PaddleOCR-VL-0.9B of PaddleOCR-VL via the `transformers` library.
79
+ * ```2025.10.16``` 🚀 We release [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), — a multilingual documents parsing via a 0.9B Ultra-Compact Vision-Language Model with SOTA performance.
80
 
81
  ## Usage
82
 
 
115
 
116
  ### Accelerate VLM Inference via Optimized Inference Servers
117
 
118
+ 1. Start the VLM inference server:
119
 
120
+ You can start the vLLM inference service using one of two methods:
121
+
122
+ - Method 1: PaddleOCR method
123
+
124
+ ```bash
125
+ docker run \
126
+ --rm \
127
+ --gpus all \
128
+ --network host \
129
+ ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest \
130
+ paddleocr genai_server --model_name PaddleOCR-VL-0.9B --host 0.0.0.0 --port 8080 --backend vllm
131
+ ```
132
+
133
+ - Method 2: vLLM method
134
+
135
+ [vLLM: PaddleOCR-VL Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/PaddlePaddle/PaddleOCR-VL.html)
136
+
137
  2. Call the PaddleOCR CLI or Python API:
138
 
139
  ```bash
 
142
  --vl_rec_backend vllm-server \
143
  --vl_rec_server_url http://127.0.0.1:8080/v1
144
  ```
145
+
146
  ```python
147
  from paddleocr import PaddleOCRVL
148
  pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
 
167
  import torch
168
  from transformers import AutoModelForCausalLM, AutoProcessor
169
 
170
+ # ---- Settings ----
171
+ model_path = "PaddlePaddle/PaddleOCR-VL"
172
+ image_path = "test.png"
173
+ task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula'
174
+ # ------------------
175
+
176
  DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
177
 
 
178
  PROMPTS = {
179
  "ocr": "OCR:",
180
  "table": "Table Recognition:",
 
182
  "chart": "Chart Recognition:",
183
  }
184
 
 
 
185
  image = Image.open(image_path).convert("RGB")
186
 
187
  model = AutoModelForCausalLM.from_pretrained(
 
193
  {"role": "user",
194
  "content": [
195
  {"type": "image", "image": image},
196
+ {"type": "text", "text": PROMPTS[task]},
197
  ]
198
  }
199
  ]
 
202
  tokenize=True,
203
  add_generation_prompt=True,
204
  return_dict=True,
205
+ return_tensors="pt"
206
  ).to(DEVICE)
207
 
208
  outputs = model.generate(**inputs, max_new_tokens=1024)
 
210
  print(outputs)
211
  ```
212
 
213
+ <details>
214
+ <summary>👉 Click to expand: Use flash-attn to boost performance and reduce memory usage</summary>
215
+
216
+ ```shell
217
+ # ensure the flash-attn2 is installed
218
+ pip install flash-attn --no-build-isolation
219
+ ```
220
+
221
+ ```python
222
+ import torch
223
+ from transformers import AutoModelForCausalLM, AutoProcessor
224
+ from PIL import Image
225
+
226
+ # ---- Settings ----
227
+ model_path = "PaddlePaddle/PaddleOCR-VL"
228
+ image_path = "test.png"
229
+ task = "ocr" # ← change to "table" | "chart" | "formula"
230
+ # ------------------
231
+
232
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
233
+
234
+ model = AutoModelForCausalLM.from_pretrained(
235
+ model_path,
236
+ trust_remote_code=True,
237
+ torch_dtype=torch.bfloat16,
238
+ attn_implementation="flash_attention_2",
239
+ ).to(dtype=torch.bfloat16, device=DEVICE).eval()
240
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
241
+
242
+ PROMPTS = {
243
+ "ocr": "OCR:",
244
+ "table": "Table Recognition:",
245
+ "chart": "Chart Recognition:",
246
+ "formula": "Formula Recognition:",
247
+ }
248
+ messages = [
249
+ {
250
+ "role": "user",
251
+ "content": [
252
+ {"type": "image", "image": Image.open(image_path).convert("RGB")},
253
+ {"type": "text", "text": PROMPTS[task]}
254
+ ]
255
+ }
256
+ ]
257
+
258
+ inputs = processor.apply_chat_template(
259
+ messages,
260
+ tokenize=True,
261
+ add_generation_prompt=True,
262
+ return_dict=True,
263
+ return_tensors="pt"
264
+ ).to(DEVICE)
265
+
266
+ with torch.inference_mode():
267
+ out = model.generate(
268
+ **inputs,
269
+ max_new_tokens=1024,
270
+ do_sample=False,
271
+ use_cache=True
272
+ )
273
+
274
+ outputs = processor.batch_decode(out, skip_special_tokens=True)[0]
275
+ print(outputs)
276
+ ```
277
+
278
+ </details>
279
+
280
  ## Performance
281
 
282
  ### Page-Level Document Parsing
 
429
  primaryClass={cs.CV},
430
  url={https://arxiv.org/abs/2510.14528},
431
  }
432
+ ```