Image-Text-to-Text
Safetensors
Transformers
English
Chinese
multilingual
dots_ocr
text-generation
image-to-text
ocr
document-parse
layout
table
formula
custom_code
conversational
Instructions to use kp-forks/dots.ocr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kp-forks/dots.ocr with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="kp-forks/dots.ocr", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("kp-forks/dots.ocr", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use kp-forks/dots.ocr with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kp-forks/dots.ocr" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kp-forks/dots.ocr", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/kp-forks/dots.ocr
- SGLang
How to use kp-forks/dots.ocr with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kp-forks/dots.ocr" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kp-forks/dots.ocr", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kp-forks/dots.ocr" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kp-forks/dots.ocr", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use kp-forks/dots.ocr with Docker Model Runner:
docker model run hf.co/kp-forks/dots.ocr
Adding `transformers` as a library, and also mentioning the `custom_code` tag (#29)
Browse files- Adding `transformers` as a library, and also mentioning the `custom_code` tag (9becf2f563569f966f1825ef59e0f0a3b46e56c1)
Co-authored-by: Aritra Roy Gosthipaty <ariG23498@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -9,6 +9,8 @@ tags:
|
|
| 9 |
- layout
|
| 10 |
- table
|
| 11 |
- formula
|
|
|
|
|
|
|
| 12 |
language:
|
| 13 |
- en
|
| 14 |
- zh
|
|
@@ -49,6 +51,85 @@ dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
|
|
| 49 |
4. **Efficient and Fast Performance:** Built upon a compact 1.7B LLM, **dots.ocr** provides faster inference speeds than many other high-performing models based on larger foundations.
|
| 50 |
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
### Performance Comparison: dots.ocr vs. Competing Models
|
| 53 |
<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
|
| 54 |
|
|
@@ -1231,4 +1312,4 @@ We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://g
|
|
| 1231 |
- **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
|
| 1232 |
|
| 1233 |
We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
|
| 1234 |
-
We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [yanqing4@xiaohongshu.com].
|
|
|
|
| 9 |
- layout
|
| 10 |
- table
|
| 11 |
- formula
|
| 12 |
+
- transformers
|
| 13 |
+
- custom_code
|
| 14 |
language:
|
| 15 |
- en
|
| 16 |
- zh
|
|
|
|
| 51 |
4. **Efficient and Fast Performance:** Built upon a compact 1.7B LLM, **dots.ocr** provides faster inference speeds than many other high-performing models based on larger foundations.
|
| 52 |
|
| 53 |
|
| 54 |
+
## Usage with transformers
|
| 55 |
+
|
| 56 |
+
```py
|
| 57 |
+
import torch
|
| 58 |
+
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
|
| 59 |
+
from qwen_vl_utils import process_vision_info
|
| 60 |
+
from dots_ocr.utils import dict_promptmode_to_prompt
|
| 61 |
+
|
| 62 |
+
model_path = "./weights/DotsOCR"
|
| 63 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 64 |
+
model_path,
|
| 65 |
+
attn_implementation="flash_attention_2",
|
| 66 |
+
torch_dtype=torch.bfloat16,
|
| 67 |
+
device_map="auto",
|
| 68 |
+
trust_remote_code=True
|
| 69 |
+
)
|
| 70 |
+
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
|
| 71 |
+
|
| 72 |
+
image_path = "demo/demo_image1.jpg"
|
| 73 |
+
prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
|
| 74 |
+
|
| 75 |
+
1. Bbox format: [x1, y1, x2, y2]
|
| 76 |
+
|
| 77 |
+
2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].
|
| 78 |
+
|
| 79 |
+
3. Text Extraction & Formatting Rules:
|
| 80 |
+
- Picture: For the 'Picture' category, the text field should be omitted.
|
| 81 |
+
- Formula: Format its text as LaTeX.
|
| 82 |
+
- Table: Format its text as HTML.
|
| 83 |
+
- All Others (Text, Title, etc.): Format their text as Markdown.
|
| 84 |
+
|
| 85 |
+
4. Constraints:
|
| 86 |
+
- The output text must be the original text from the image, with no translation.
|
| 87 |
+
- All layout elements must be sorted according to human reading order.
|
| 88 |
+
|
| 89 |
+
5. Final Output: The entire output must be a single JSON object.
|
| 90 |
+
"""
|
| 91 |
+
|
| 92 |
+
messages = [
|
| 93 |
+
{
|
| 94 |
+
"role": "user",
|
| 95 |
+
"content": [
|
| 96 |
+
{
|
| 97 |
+
"type": "image",
|
| 98 |
+
"image": image_path
|
| 99 |
+
},
|
| 100 |
+
{"type": "text", "text": prompt}
|
| 101 |
+
]
|
| 102 |
+
}
|
| 103 |
+
]
|
| 104 |
+
|
| 105 |
+
# Preparation for inference
|
| 106 |
+
text = processor.apply_chat_template(
|
| 107 |
+
messages,
|
| 108 |
+
tokenize=False,
|
| 109 |
+
add_generation_prompt=True
|
| 110 |
+
)
|
| 111 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 112 |
+
inputs = processor(
|
| 113 |
+
text=[text],
|
| 114 |
+
images=image_inputs,
|
| 115 |
+
videos=video_inputs,
|
| 116 |
+
padding=True,
|
| 117 |
+
return_tensors="pt",
|
| 118 |
+
)
|
| 119 |
+
|
| 120 |
+
inputs = inputs.to("cuda")
|
| 121 |
+
|
| 122 |
+
# Inference: Generation of the output
|
| 123 |
+
generated_ids = model.generate(**inputs, max_new_tokens=24000)
|
| 124 |
+
generated_ids_trimmed = [
|
| 125 |
+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
| 126 |
+
]
|
| 127 |
+
output_text = processor.batch_decode(
|
| 128 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 129 |
+
)
|
| 130 |
+
print(output_text)
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
### Performance Comparison: dots.ocr vs. Competing Models
|
| 134 |
<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
|
| 135 |
|
|
|
|
| 1312 |
- **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
|
| 1313 |
|
| 1314 |
We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
|
| 1315 |
+
We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [yanqing4@xiaohongshu.com].
|