PaddlePaddle/PaddleOCR-VL · docs: Readme Updated for optimized Usage with transformers library

docs: Readme Updated for optimized Usage with transformers library

#60

by sayed99 - opened 29 days ago

base: refs/heads/main

←

from: refs/pr/60

Discussion Files changed

+99

-16

Files changed (1) hide show

README.md +99 -16

README.md CHANGED Viewed

@@ -72,9 +72,11 @@ PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vi
 ## News
-* ```2025.10.16``` 🚀 We release [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), — a multilingual documents parsing via a 0.9B Ultra-Compact Vision-Language Model with SOTA performance.
-* ```2025.10.29``` Supports calling the core module PaddleOCR-VL-0.9B of PaddleOCR-VL via the `transformers` library.
 ## Usage
@@ -113,15 +115,25 @@ for res in output:
 ### Accelerate VLM Inference via Optimized Inference Servers
-1. Start the VLM inference server (the default port is `8080`):
-    ```bash
-    docker run \
-        --rm \
-        --gpus all \
-        --network host \
-        ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server
-    ```
 2. Call the PaddleOCR CLI or Python API:
     ```bash
@@ -130,6 +142,7 @@ for res in output:
         --vl_rec_backend vllm-server \
         --vl_rec_server_url http://127.0.0.1:8080/v1
     ```
     ```python
     from paddleocr import PaddleOCRVL
     pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
@@ -154,9 +167,14 @@ from PIL import Image
 import torch
 from transformers import AutoModelForCausalLM, AutoProcessor
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
-CHOSEN_TASK = "ocr"  # Options: 'ocr' | 'table' | 'chart' | 'formula'
 PROMPTS = {
     "ocr": "OCR:",
     "table": "Table Recognition:",
@@ -164,8 +182,6 @@ PROMPTS = {
     "chart": "Chart Recognition:",
 }
-model_path = "PaddlePaddle/PaddleOCR-VL"
-image_path = "test.png"
 image = Image.open(image_path).convert("RGB")
 model = AutoModelForCausalLM.from_pretrained(
@@ -177,7 +193,7 @@ messages = [
     {"role": "user",
      "content": [
             {"type": "image", "image": image},
-            {"type": "text", "text": PROMPTS[CHOSEN_TASK]},
         ]
     }
 ]
@@ -186,7 +202,7 @@ inputs = processor.apply_chat_template(
     tokenize=True,
     add_generation_prompt=True,
     return_dict=True,
-	return_tensors="pt"
 ).to(DEVICE)
 outputs = model.generate(**inputs, max_new_tokens=1024)
@@ -194,6 +210,73 @@ outputs = processor.batch_decode(outputs, skip_special_tokens=True)[0]
 print(outputs)
 ```
 ## Performance
 ### Page-Level Document Parsing
@@ -346,4 +429,4 @@ If you find PaddleOCR-VL helpful, feel free to give us a star and citation.
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2510.14528},
 }
-```

 ## News
+* ```2025.11.07``` 🚀 Enabled `flash-attn` in the `transformers` library to achieve faster inference with PaddleOCR-VL-0.9B.
+* ```2025.11.04``` 🌟 PaddleOCR-VL-0.9B is now officially supported on `vLLM` .
+* ```2025.10.29``` 🤗 Supports calling the core module PaddleOCR-VL-0.9B of PaddleOCR-VL via the `transformers` library.
+* ```2025.10.16``` 🚀 We release [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), — a multilingual documents parsing via a 0.9B Ultra-Compact Vision-Language Model with SOTA performance.
 ## Usage
 ### Accelerate VLM Inference via Optimized Inference Servers
+1. Start the VLM inference server:
+    You can start the vLLM inference service using one of two methods:
+    - Method 1: PaddleOCR method
+        ```bash
+        docker run \
+            --rm \
+            --gpus all \
+            --network host \
+            ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest \
+            paddleocr genai_server --model_name PaddleOCR-VL-0.9B --host 0.0.0.0 --port 8080 --backend vllm
+        ```
+    - Method 2: vLLM method
+        [vLLM: PaddleOCR-VL Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/PaddlePaddle/PaddleOCR-VL.html)
 2. Call the PaddleOCR CLI or Python API:
     ```bash
         --vl_rec_backend vllm-server \
         --vl_rec_server_url http://127.0.0.1:8080/v1
     ```
     ```python
     from paddleocr import PaddleOCRVL
     pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
 import torch
 from transformers import AutoModelForCausalLM, AutoProcessor
+# ---- Settings ----
+model_path = "PaddlePaddle/PaddleOCR-VL"
+image_path = "test.png"
+task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula'
+# ------------------
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 PROMPTS = {
     "ocr": "OCR:",
     "table": "Table Recognition:",
     "chart": "Chart Recognition:",
 }
 image = Image.open(image_path).convert("RGB")
 model = AutoModelForCausalLM.from_pretrained(
     {"role": "user",
      "content": [
             {"type": "image", "image": image},
+            {"type": "text", "text": PROMPTS[task]},
         ]
     }
 ]
     tokenize=True,
     add_generation_prompt=True,
     return_dict=True,
+    return_tensors="pt"
 ).to(DEVICE)
 outputs = model.generate(**inputs, max_new_tokens=1024)
 print(outputs)
 ```
+<details>
+<summary>👉 Click to expand: Use flash-attn to boost performance and reduce memory usage</summary>
+```shell
+# ensure the flash-attn2 is installed
+pip install flash-attn --no-build-isolation
+```
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor
+from PIL import Image
+# ---- Settings ----
+model_path = "PaddlePaddle/PaddleOCR-VL"
+image_path = "test.png"
+task = "ocr" # ← change to "table" | "chart" | "formula"
+# ------------------
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+).to(dtype=torch.bfloat16, device=DEVICE).eval()
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+PROMPTS = {
+    "ocr": "OCR:",
+    "table": "Table Recognition:",
+    "chart": "Chart Recognition:",
+    "formula": "Formula Recognition:",
+}
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": Image.open(image_path).convert("RGB")},
+            {"type": "text",  "text": PROMPTS[task]}
+        ]
+    }
+]
+inputs = processor.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(DEVICE)
+with torch.inference_mode():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=1024,
+        do_sample=False,
+        use_cache=True
+    )
+outputs = processor.batch_decode(out, skip_special_tokens=True)[0]
+print(outputs)
+```
+</details>
 ## Performance
 ### Page-Level Document Parsing
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2510.14528},
 }
+```