surya-ocr-2-poneglyph-bbox

Surya OCR 2 fine-tuned for One Piece manga bubble text plus bounding boxes

This model reads a full manga page and emits one line per dialogue bubble:

Text content [x1,y1,x2,y2]

Coordinates are normalized to [0, 1000] on the resized page image.


Why Surya For BBox

The upstream Surya OCR 2 card documents bbox-capable outputs in three relevant paths:

  • OCR output includes per-block polygon, axis-aligned bbox, confidence, and reading order.
  • surya_detect returns text-line bboxes and polygons.
  • surya_layout returns layout boxes, labels, reading order, and bbox values.

This fine-tune uses the Hugging Face image-text-to-text Surya OCR 2 model and teaches the generated text stream to match the existing Poneglyph bbox contract.


Benchmark: Surya vs LightOn BBox Poneglyph

Metric Surya OCR 2 fine-tuned LightOn bbox Poneglyph Winner
CER 2.62% 0.64% LightOn
WER 4.70% 1.80% LightOn
Mean IoU 92.03% 73.55% Surya
Median IoU 93.65% 74.43% Surya
F1 @ IoU=0.5 95.92% 77.71% Surya
Precision @ 0.5 95.96% 77.31% Surya
Recall @ 0.5 96.60% 78.68% Surya
Detection Rate 97.57% 98.85% LightOn
Combined Score 0.959 0.877 Surya
Avg Inference 9.38s/page 4.62s/page LightOn

Surya Fine-Tuned Snapshot

Metric Score
CER 2.62%
WER 4.70%
Mean IoU 92.03%
Median IoU 93.65%
F1 @ IoU=0.3 96.21%
F1 @ IoU=0.5 95.92%
F1 @ IoU=0.75 93.57%
Detection Rate 97.57%
Combined Score 0.959
Avg Inference 9.38s/page

Combined score:

0.4 * (1 - CER) + 0.3 * F1@0.5 + 0.2 * MeanIoU + 0.1 * DetectionRate

Dataset

Source data comes from the Poneglyph Supabase bulles table, filtered to validated annotations, grouped at page level, and split by id_page to prevent page leakage.

Split Pages Bubbles
train 599 5415
val 128 1201
test 129 1141

Preprocessing:

  • Full page image resized to 1540px longest side.
  • JPEG quality 95.
  • Bubble boxes normalized to [0, 1000].
  • Target order follows the stored manga reading order.
  • Target text uses one strict line per bubble.

How To Use

pip install torch pillow transformers accelerate
import re
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

MODEL_ID = "Remidesbois/surya-ocr-2-poneglyph-bbox"
PROMPT = "Extrais le texte des bulles de cette page de manga dans l'ordre de lecture japonais, avec leurs bbox normalisees entre 0 et 1000. Format strict: Texte [x1,y1,x2,y2]."

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
).eval()

image = Image.open("page.jpg").convert("RGB")
image.thumbnail((1540, 1540), Image.Resampling.LANCZOS)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "page.jpg"},
            {"type": "text", "text": PROMPT},
        ],
    }
]

prompt = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)
inputs = processor(text=[prompt], images=[image], return_tensors="pt")
inputs = {
    k: v.to(model.device, dtype=torch.bfloat16) if v.is_floating_point() else v.to(model.device)
    for k, v in inputs.items()
}

with torch.inference_mode():
    output_ids = model.generate(**inputs, max_new_tokens=2048, do_sample=False)

generated = output_ids[0, inputs["input_ids"].shape[1]:]
text = processor.decode(generated, skip_special_tokens=True).strip()
print(text)

pattern = re.compile(r"(.+?)\s*\[(\d+),(\d+),(\d+),(\d+)\]")
bubbles = [
    {"text": m.group(1).strip(), "bbox": [int(m.group(i)) for i in range(2, 6)]}
    for line in text.splitlines()
    if (m := pattern.match(line.strip()))
]

Training

The training package used for this model lives in:

docker_scripts/finetune_surya_ocr_bbox

Pipeline:

python run_pipeline.py --dry-run --check-remote
python run_pipeline.py

The run exports the dataset, fine-tunes Surya OCR 2 with LoRA/DoRA, benchmarks the held-out test split, benchmarks Remidesbois/LightonOCR-2-1b-poneglyph-bbox on the same pages, writes this README, and uploads the final merged model when HF_TOKEN is available.


Limitations

  • Domain-specific: trained for One Piece manga pages.
  • Text language: French annotations.
  • Output is a generated text contract, so malformed lines are possible and should be parsed defensively.
  • The model returns normalized bbox coordinates, not pixel coordinates.
  • The LightOn comparison is only valid when both models are evaluated on the same exported test split.

Base Model

Fine-tuned from datalab-to/surya-ocr-2. The base model uses Surya OCR 2 / Qwen3.5 image-text-to-text architecture.


Fine-tuned by Remidesbois.

Downloads last month
11
Safetensors
Model size
0.7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Remidesbois/surya-ocr-2-poneglyph-bbox

Finetuned
(2)
this model