hipfire-dots.ocr
License: MIT, under the upstream dots.ocr License Agreement (copied verbatim from rednote-hilab/dots.ocr). The Agreement sits on top of the MIT license; where the two conflict, MIT prevails (Agreement Β§1.6). This is a quantized derivative of a model created and released by rednote / Xiaohongshu (Xingyin Information Technology (Shanghai) Co., Ltd.); all upstream attribution and license terms apply. Built with dots.ocr.
hipfire-native Q8 quantization of rednote-hilab/dots.ocr β a multilingual document layout-parsing vision-language model that unifies layout detection and content recognition in a single VLM while preserving reading order. Despite a compact 1.7B-parameter LLM foundation, dots.ocr is state-of-the-art on OmniDocBench.
The .hfq file runs with the hipfire
inference engine β a Rust + HIP/ROCm-direct runtime for AMD RDNA GPUs with no
Python in the hot path (Ollama-style UX). It is hipfire's HFQ container
format (HFQM-magic manifest): not GGUF or safetensors, and it will not
load in llama.cpp or transformers. All weights are stored at Q8 and
dequantized to f32 on the fly inside the GEMV kernels.
dots.ocr is a first-class supported architecture in hipfire's registry
("Qwen2-VL-family layout-extraction VLM β image β structured OCR"; bring-your-own
via hipfire quantize).
Files
| File | Quant | Size | sha256 |
|---|---|---|---|
dots-ocr.q8.hfq |
Q8 (all tensors) | 4.42 GB | eec256b1β¦b6f5b268 |
HFQ header β embedded architecture / config
The HFQM manifest carries the full dots.ocr config, tokenizer, and generation
config verbatim from the upstream checkpoint:
- architecture:
dots_ocr(DotsOCRForCausalLM) β Qwen2-VL family - LLM: 28 layers, hidden 1536, 12 attention heads / 2 KV heads (GQA), intermediate 8960, vocab 151936, RoPE ΞΈ = 1e6, max position 131072, SwiGLU
- Vision tower: 42 layers, embed_dim 1536, 12 heads, patch 14, spatial-merge 2, 3-channel; optimal under ~11.3M px
- generation:
eos_token_id=[151643, 151673],max_length32768 - source dtype: bfloat16 β re-quantized to Q8
About dots.ocr
dots.ocr is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art (SOTA) performance.
- Powerful Performance: dots.ocr achieves SOTA for text, tables, and reading order on OmniDocBench, while delivering formula recognition comparable to much larger models like Doubao-1.5 and Gemini 2.5-Pro.
- Multilingual Support: robust parsing for low-resource languages, with decisive advantages across both layout detection and content recognition on dots.ocr's in-house 100-language benchmark.
- Unified and Simple Architecture: a single VLM is far more streamlined than conventional multi-model pipelines. Switching tasks is just changing the input prompt β competitive with dedicated detectors like DocLayout-YOLO.
- Efficient and Fast: built on a compact 1.7B LLM, it is faster than larger foundation-based parsers.
Performance highlights (upstream dots.ocr, bfloat16)
These are the upstream model's published numbers. See the original card and the dots.ocr repo for the full benchmark tables (OmniDocBench, dots.ocr-bench, olmOCR-bench, layout detection).
OmniDocBench β end-to-end (Edit distance β is better; TEDS β is better). Lower is better except Table TEDS.
| Method | Overall Edit β (EN / ZH) | Text Edit β (EN / ZH) | Table TEDS β (EN / ZH) | Read Order Edit β (EN / ZH) |
|---|---|---|---|---|
| MinerU 2 | 0.139 / 0.240 | 0.047 / 0.109 | 82.5 / 79.0 | 0.069 / 0.118 |
| MonkeyOCR-pro-3B | 0.138 / 0.206 | 0.067 / 0.107 | 81.5 / 87.5 | 0.100 / 0.185 |
| Mistral OCR | 0.268 / 0.439 | 0.072 / 0.325 | 75.8 / 63.6 | 0.083 / 0.284 |
| Gemini 2.5-Pro | 0.148 / 0.212 | 0.055 / 0.168 | 85.8 / 86.4 | 0.049 / 0.121 |
| doubao-1.5-thinking-vision-pro | 0.140 / 0.162 | 0.043 / 0.085 | 83.3 / 89.3 | 0.058 / 0.094 |
| dots.ocr | 0.125 / 0.160 | 0.032 / 0.066 | 88.6 / 89.0 | 0.040 / 0.067 |
dots.ocr also leads olmOCR-bench overall (79.1 Β± 1.0, best on Tables 88.3, Multi-column 82.4, Base 99.5) and dots.ocr-bench (100 languages, Overall Edit 0.177), and its layout-detection-only mode reaches F1@IoU .50 = 0.930 overall.
Usage (hipfire)
# install hipfire (Linux + ROCm 6+)
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash
# download this Q8 file directly
hf download hipfire-models/hipfire-dots.ocr dots-ocr.q8.hfq --local-dir ~/.hipfire/models
# serve it (OpenAI-compatible HTTP API on 0.0.0.0:11435)
hipfire serve --model ~/.hipfire/models/dots-ocr.q8.hfq
# or one-shot an image
hipfire run --model dots-ocr.q8.hfq --image page.png --prompt "$(cat prompt.txt)"
Layout-parsing prompt
dots.ocr is prompt-driven β the same VLM does full parse, detection-only, text-only, or grounding-OCR depending on the input prompt. The default "parse all" prompt (see the upstream prompts.py) asks for one JSON object with each layout element's bbox, category, and text:
Please output the layout information from the PDF image, including each layout
element's bbox, its category, and the corresponding text content within the bbox.
1. Bbox format: [x1, y1, x2, y2]
2. Layout Categories: ['Caption','Footnote','Formula','List-item','Page-footer',
'Page-header','Picture','Section-header','Table','Text','Title'].
3. Text Extraction & Formatting Rules:
- Picture: omit the text field.
- Formula: format its text as LaTeX.
- Table: format its text as HTML.
- All Others (Text, Title, etc.): format their text as Markdown.
4. Constraints:
- The output text must be the original text from the image, with no translation.
- All layout elements must be sorted according to human reading order.
5. Final Output: The entire output must be a single JSON object.
Switch tasks by swapping the prompt: prompt_layout_only_en (detection only),
prompt_ocr (text only, skipping headers/footers), or prompt_grounding_ocr
(parse a single bbox).
Quantization & validation
Quantized from the upstream bfloat16 safetensors to Q8 (8-bit per-tensor)
through hipfire's hipfire quantize path; weights are stored at Q8 and
dequantized to f32 inside the GEMV kernels β the math is f32, the bit-width is a
storage/bandwidth dial. The upstream tokenizer, chat_template, config, and
generation config are embedded in the HFQM manifest.
architecture dots_ocr in the hipfire HFQ header.
Limitations (from the upstream card)
- Complex elements: not yet perfect on high-complexity tables and formula extraction; pictures in documents are currently not parsed.
- Parsing failures: may fail when the character-to-pixel ratio is very high
(enlarge the image, or raise PDF DPI β 200 is recommended; the model is optimal
under ~11.3M px). Runs of special characters (
...,___) can cause output to repeat β in that case switch toprompt_layout_only_en,prompt_ocr, orprompt_grounding_ocr. - Throughput: despite the 1.7B LLM, not yet optimized for high-throughput bulk-PDF processing.
Citation
If you use this work, please cite the original dots.ocr model and team:
@misc{dotsocr,
title = {dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model},
author = {rednote-hilab},
year = {2025},
url = {https://huggingface.co/rednote-hilab/dots.ocr}
}
License & attribution
This repository distributes a quantized Derivative Work of rednote-hilab/dots.ocr. The model is Β© Xingyin Information Technology (Shanghai) Co., Ltd. and is licensed under the dots.ocr License Agreement (MIT-based; copied verbatim into this repo from the upstream repository).
Modification notice: the upstream bfloat16 safetensors weights were re-quantized into hipfire's Q8 / HFQ container format, and the upstream tokenizer / chat template / config were embedded into the HFQ metadata. No other changes to the model. Per the Agreement Β§7, redistribution includes a copy of the Agreement and retains all attribution notices; modified-weight releases should display "Built with dots.ocr."
hipfire itself is dual-licensed MIT / Apache-2.0 β see the hipfire repo.
Model tree for hipfire-models/hipfire-dots.ocr
Base model
rednote-hilab/dots.ocr