Instructions to use Remidesbois/surya-ocr-2-poneglyph-bbox with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Remidesbois/surya-ocr-2-poneglyph-bbox with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Remidesbois/surya-ocr-2-poneglyph-bbox") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Remidesbois/surya-ocr-2-poneglyph-bbox") model = AutoModelForMultimodalLM.from_pretrained("Remidesbois/surya-ocr-2-poneglyph-bbox") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Remidesbois/surya-ocr-2-poneglyph-bbox with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Remidesbois/surya-ocr-2-poneglyph-bbox" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Remidesbois/surya-ocr-2-poneglyph-bbox", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Remidesbois/surya-ocr-2-poneglyph-bbox
- SGLang
How to use Remidesbois/surya-ocr-2-poneglyph-bbox with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Remidesbois/surya-ocr-2-poneglyph-bbox" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Remidesbois/surya-ocr-2-poneglyph-bbox", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Remidesbois/surya-ocr-2-poneglyph-bbox" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Remidesbois/surya-ocr-2-poneglyph-bbox", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Remidesbois/surya-ocr-2-poneglyph-bbox with Docker Model Runner:
docker model run hf.co/Remidesbois/surya-ocr-2-poneglyph-bbox
surya-ocr-2-poneglyph-bbox
Surya OCR 2 fine-tuned for One Piece manga bubble text plus bounding boxes
This model reads a full manga page and emits one line per dialogue bubble:
Text content [x1,y1,x2,y2]
Coordinates are normalized to [0, 1000] on the resized page image.
Why Surya For BBox
The upstream Surya OCR 2 card documents bbox-capable outputs in three relevant paths:
- OCR output includes per-block
polygon, axis-alignedbbox, confidence, and reading order. surya_detectreturns text-line bboxes and polygons.surya_layoutreturns layout boxes, labels, reading order, and bbox values.
This fine-tune uses the Hugging Face image-text-to-text Surya OCR 2 model and
teaches the generated text stream to match the existing Poneglyph bbox contract.
Benchmark: Surya vs LightOn BBox Poneglyph
| Metric | Surya OCR 2 fine-tuned | LightOn bbox Poneglyph | Winner |
|---|---|---|---|
| CER | 2.62% | 0.64% | LightOn |
| WER | 4.70% | 1.80% | LightOn |
| Mean IoU | 92.03% | 73.55% | Surya |
| Median IoU | 93.65% | 74.43% | Surya |
| F1 @ IoU=0.5 | 95.92% | 77.71% | Surya |
| Precision @ 0.5 | 95.96% | 77.31% | Surya |
| Recall @ 0.5 | 96.60% | 78.68% | Surya |
| Detection Rate | 97.57% | 98.85% | LightOn |
| Combined Score | 0.959 | 0.877 | Surya |
| Avg Inference | 9.38s/page | 4.62s/page | LightOn |
Surya Fine-Tuned Snapshot
| Metric | Score |
|---|---|
| CER | 2.62% |
| WER | 4.70% |
| Mean IoU | 92.03% |
| Median IoU | 93.65% |
| F1 @ IoU=0.3 | 96.21% |
| F1 @ IoU=0.5 | 95.92% |
| F1 @ IoU=0.75 | 93.57% |
| Detection Rate | 97.57% |
| Combined Score | 0.959 |
| Avg Inference | 9.38s/page |
Combined score:
0.4 * (1 - CER) + 0.3 * F1@0.5 + 0.2 * MeanIoU + 0.1 * DetectionRate
Dataset
Source data comes from the Poneglyph Supabase bulles table, filtered to
validated annotations, grouped at page level, and split by id_page to prevent
page leakage.
| Split | Pages | Bubbles |
|---|---|---|
| train | 599 | 5415 |
| val | 128 | 1201 |
| test | 129 | 1141 |
Preprocessing:
- Full page image resized to 1540px longest side.
- JPEG quality 95.
- Bubble boxes normalized to
[0, 1000]. - Target order follows the stored manga reading order.
- Target text uses one strict line per bubble.
How To Use
pip install torch pillow transformers accelerate
import re
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
MODEL_ID = "Remidesbois/surya-ocr-2-poneglyph-bbox"
PROMPT = "Extrais le texte des bulles de cette page de manga dans l'ordre de lecture japonais, avec leurs bbox normalisees entre 0 et 1000. Format strict: Texte [x1,y1,x2,y2]."
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
).eval()
image = Image.open("page.jpg").convert("RGB")
image.thumbnail((1540, 1540), Image.Resampling.LANCZOS)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "page.jpg"},
{"type": "text", "text": PROMPT},
],
}
]
prompt = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False,
)
inputs = processor(text=[prompt], images=[image], return_tensors="pt")
inputs = {
k: v.to(model.device, dtype=torch.bfloat16) if v.is_floating_point() else v.to(model.device)
for k, v in inputs.items()
}
with torch.inference_mode():
output_ids = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
generated = output_ids[0, inputs["input_ids"].shape[1]:]
text = processor.decode(generated, skip_special_tokens=True).strip()
print(text)
pattern = re.compile(r"(.+?)\s*\[(\d+),(\d+),(\d+),(\d+)\]")
bubbles = [
{"text": m.group(1).strip(), "bbox": [int(m.group(i)) for i in range(2, 6)]}
for line in text.splitlines()
if (m := pattern.match(line.strip()))
]
Training
The training package used for this model lives in:
docker_scripts/finetune_surya_ocr_bbox
Pipeline:
python run_pipeline.py --dry-run --check-remote
python run_pipeline.py
The run exports the dataset, fine-tunes Surya OCR 2 with LoRA/DoRA, benchmarks
the held-out test split, benchmarks Remidesbois/LightonOCR-2-1b-poneglyph-bbox on the same pages, writes
this README, and uploads the final merged model when HF_TOKEN is available.
Limitations
- Domain-specific: trained for One Piece manga pages.
- Text language: French annotations.
- Output is a generated text contract, so malformed lines are possible and should be parsed defensively.
- The model returns normalized bbox coordinates, not pixel coordinates.
- The LightOn comparison is only valid when both models are evaluated on the same exported test split.
Base Model
Fine-tuned from datalab-to/surya-ocr-2.
The base model uses Surya OCR 2 / Qwen3.5 image-text-to-text architecture.
Fine-tuned by Remidesbois.
- Downloads last month
- 11
Model tree for Remidesbois/surya-ocr-2-poneglyph-bbox
Base model
datalab-to/surya-ocr-2