hipfire-dots.ocr

License: MIT, under the upstream dots.ocr License Agreement (copied verbatim from rednote-hilab/dots.ocr). The Agreement sits on top of the MIT license; where the two conflict, MIT prevails (Agreement §1.6). This is a quantized derivative of a model created and released by rednote / Xiaohongshu (Xingyin Information Technology (Shanghai) Co., Ltd.); all upstream attribution and license terms apply. Built with dots.ocr.

hipfire-native Q8 quantization of rednote-hilab/dots.ocr — a multilingual document layout-parsing vision-language model that unifies layout detection and content recognition in a single VLM while preserving reading order. Despite a compact 1.7B-parameter LLM foundation, dots.ocr is state-of-the-art on OmniDocBench.

The .hfq file runs with the hipfire inference engine — a Rust + HIP/ROCm-direct runtime for AMD RDNA GPUs with no Python in the hot path (Ollama-style UX). It is hipfire's HFQ container format (HFQM-magic manifest): not GGUF or safetensors, and it will not load in llama.cpp or transformers. All weights are stored at Q8 and dequantized to f32 on the fly inside the GEMV kernels.

dots.ocr is a first-class supported architecture in hipfire's registry ("Qwen2-VL-family layout-extraction VLM — image → structured OCR"; bring-your-own via hipfire quantize).

Files

File	Quant	Size	sha256
`dots-ocr.q8.hfq`	Q8 (all tensors)	4.42 GB	`eec256b1…b6f5b268`

HFQ header — embedded architecture / config

The HFQM manifest carries the full dots.ocr config, tokenizer, and generation config verbatim from the upstream checkpoint:

architecture: dots_ocr (DotsOCRForCausalLM) — Qwen2-VL family
LLM: 28 layers, hidden 1536, 12 attention heads / 2 KV heads (GQA), intermediate 8960, vocab 151936, RoPE θ = 1e6, max position 131072, SwiGLU
Vision tower: 42 layers, embed_dim 1536, 12 heads, patch 14, spatial-merge 2, 3-channel; optimal under ~11.3M px
generation: eos_token_id = [151643, 151673], max_length 32768
source dtype: bfloat16 → re-quantized to Q8

About dots.ocr

dots.ocr is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art (SOTA) performance.

Powerful Performance: dots.ocr achieves SOTA for text, tables, and reading order on OmniDocBench, while delivering formula recognition comparable to much larger models like Doubao-1.5 and Gemini 2.5-Pro.
Multilingual Support: robust parsing for low-resource languages, with decisive advantages across both layout detection and content recognition on dots.ocr's in-house 100-language benchmark.
Unified and Simple Architecture: a single VLM is far more streamlined than conventional multi-model pipelines. Switching tasks is just changing the input prompt — competitive with dedicated detectors like DocLayout-YOLO.
Efficient and Fast: built on a compact 1.7B LLM, it is faster than larger foundation-based parsers.

Performance highlights (upstream dots.ocr, bfloat16)

These are the upstream model's published numbers. See the original card and the dots.ocr repo for the full benchmark tables (OmniDocBench, dots.ocr-bench, olmOCR-bench, layout detection).

OmniDocBench — end-to-end (Edit distance ↓ is better; TEDS ↑ is better). Lower is better except Table TEDS.

Method	Overall Edit ↓ (EN / ZH)	Text Edit ↓ (EN / ZH)	Table TEDS ↑ (EN / ZH)	Read Order Edit ↓ (EN / ZH)
MinerU 2	0.139 / 0.240	0.047 / 0.109	82.5 / 79.0	0.069 / 0.118
MonkeyOCR-pro-3B	0.138 / 0.206	0.067 / 0.107	81.5 / 87.5	0.100 / 0.185
Mistral OCR	0.268 / 0.439	0.072 / 0.325	75.8 / 63.6	0.083 / 0.284
Gemini 2.5-Pro	0.148 / 0.212	0.055 / 0.168	85.8 / 86.4	0.049 / 0.121
doubao-1.5-thinking-vision-pro	0.140 / 0.162	0.043 / 0.085	83.3 / 89.3	0.058 / 0.094
dots.ocr	0.125 / 0.160	0.032 / 0.066	88.6 / 89.0	0.040 / 0.067

dots.ocr also leads olmOCR-bench overall (79.1 ± 1.0, best on Tables 88.3, Multi-column 82.4, Base 99.5) and dots.ocr-bench (100 languages, Overall Edit 0.177), and its layout-detection-only mode reaches F1@IoU .50 = 0.930 overall.

Usage (hipfire)

# install hipfire (Linux + ROCm 6+)
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# download this Q8 file directly
hf download hipfire-models/hipfire-dots.ocr dots-ocr.q8.hfq --local-dir ~/.hipfire/models

# serve it (OpenAI-compatible HTTP API on 0.0.0.0:11435)
hipfire serve --model ~/.hipfire/models/dots-ocr.q8.hfq

# or one-shot an image
hipfire run --model dots-ocr.q8.hfq --image page.png --prompt "$(cat prompt.txt)"

Layout-parsing prompt

dots.ocr is prompt-driven — the same VLM does full parse, detection-only, text-only, or grounding-OCR depending on the input prompt. The default "parse all" prompt (see the upstream prompts.py) asks for one JSON object with each layout element's bbox, category, and text:

Please output the layout information from the PDF image, including each layout
element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]
2. Layout Categories: ['Caption','Footnote','Formula','List-item','Page-footer',
   'Page-header','Picture','Section-header','Table','Text','Title'].
3. Text Extraction & Formatting Rules:
   - Picture: omit the text field.
   - Formula: format its text as LaTeX.
   - Table: format its text as HTML.
   - All Others (Text, Title, etc.): format their text as Markdown.
4. Constraints:
   - The output text must be the original text from the image, with no translation.
   - All layout elements must be sorted according to human reading order.
5. Final Output: The entire output must be a single JSON object.

Switch tasks by swapping the prompt: prompt_layout_only_en (detection only), prompt_ocr (text only, skipping headers/footers), or prompt_grounding_ocr (parse a single bbox).

Quantization & validation

Quantized from the upstream bfloat16 safetensors to Q8 (8-bit per-tensor) through hipfire's hipfire quantize path; weights are stored at Q8 and dequantized to f32 inside the GEMV kernels — the math is f32, the bit-width is a storage/bandwidth dial. The upstream tokenizer, chat_template, config, and generation config are embedded in the HFQM manifest.

architecture dots_ocr in the hipfire HFQ header.

Limitations (from the upstream card)

Complex elements: not yet perfect on high-complexity tables and formula extraction; pictures in documents are currently not parsed.
Parsing failures: may fail when the character-to-pixel ratio is very high (enlarge the image, or raise PDF DPI — 200 is recommended; the model is optimal under ~11.3M px). Runs of special characters (..., ___) can cause output to repeat — in that case switch to prompt_layout_only_en, prompt_ocr, or prompt_grounding_ocr.
Throughput: despite the 1.7B LLM, not yet optimized for high-throughput bulk-PDF processing.

Citation

If you use this work, please cite the original dots.ocr model and team:

@misc{dotsocr,
  title  = {dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model},
  author = {rednote-hilab},
  year   = {2025},
  url    = {https://huggingface.co/rednote-hilab/dots.ocr}
}

License & attribution

This repository distributes a quantized Derivative Work of rednote-hilab/dots.ocr. The model is © Xingyin Information Technology (Shanghai) Co., Ltd. and is licensed under the dots.ocr License Agreement (MIT-based; copied verbatim into this repo from the upstream repository).

Modification notice: the upstream bfloat16 safetensors weights were re-quantized into hipfire's Q8 / HFQ container format, and the upstream tokenizer / chat template / config were embedded into the HFQ metadata. No other changes to the model. Per the Agreement §7, redistribution includes a copy of the Agreement and retains all attribution notices; modified-weight releases should display "Built with dots.ocr."

hipfire itself is dual-licensed MIT / Apache-2.0 — see the hipfire repo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hipfire-models/hipfire-dots.ocr

Base model

rednote-hilab/dots.ocr

Quantized

(9)

this model