YOLO26s Manga Panel, Text, and Balloon Segmentation

This is a YOLO26s segmentation model for manga page layout analysis. It detects and segments three region types needed by manga OCR and translation pipelines:

  • panels / frames,
  • text regions,
  • speech or narration balloons.

The model is intended for manga document-understanding workflows where page regions must be located before OCR, reading-order reconstruction, translation, inpainting, or human review.

Model Details

Model Description

This model is an Ultralytics-compatible YOLO26s instance segmentation model trained on Manga109-derived segmentation data. It predicts bounding boxes, class IDs, confidence scores, and pixel masks for manga page regions.

  • Developed by: ShadowB / Abdelhadi Marjane
  • Model type: Image segmentation / instance segmentation
  • Architecture: YOLO26s segmentation model (yolo26s-seg.yaml)
  • Base checkpoint: yolo26s-seg.pt
  • Library: Ultralytics
  • Task: Manga region instance segmentation
  • Primary domain: Manga/comic page images
  • Languages: Japanese manga pages. The model detects page regions visually; it does not read or translate text.
  • License: MIT for this model repository. Dataset licenses and access rules may differ.
  • Number of classes: 3
  • Parameters: 11,436,269
  • Stride: 8, 16, 32
  • Checkpoint size: about 23.4 MB for best.pt
  • Training date recorded in checkpoint: 2026-04-29
  • Ultralytics version recorded in checkpoint: 8.4.43
  • This model is part of github:sadowb/CuratorML a translation workspace that uses it for region detection.

Label Schema

The labels are stored in the model checkpoint and match the expected YOLO dataset names mapping:

Class ID Label Description
0 frame Manga page panel/frame regions, including bordered or visually separated panels
1 text Visible text regions, usually the regions passed to OCR or translation post-processing
2 balloon Speech balloons, thought bubbles, narration bubbles, or similar text containers

Notes:

  • Background is not an explicit class.
  • text is the visual text region, not the OCR transcription.
  • balloon is the container region around dialogue or narration text.
  • frame is the panel/layout region, not necessarily a semantic scene label.
  • Keep this class order unchanged in data.yaml, inference code, and downstream post-processing.

Recommended dataset config:

names:
  0: frame
  1: text
  2: balloon

Uses

Direct Use

Use this model to segment manga page regions from a page image. Direct outputs can be used to locate:

  • panels/frames,
  • text regions,
  • speech or narration balloons.

Downstream Use

This model is designed to be one component in a larger manga translation or document-understanding system. A typical downstream flow is:

  1. segment panels, balloons, and text regions,
  2. associate text regions with balloons and panels,
  3. run OCR on text regions,
  4. reconstruct reading order,
  5. translate text with surrounding visual/layout context,
  6. inpaint or clean original text,
  7. render translated text back into the page.

Example structured output expected by a downstream pipeline:

{
  "page": "example_page.jpg",
  "regions": [
    {
      "id": 1,
      "class_id": 0,
      "label": "frame",
      "confidence": 0.94,
      "bbox": [x1, y1, x2, y2],
      "mask": "..."
    },
    {
      "id": 2,
      "class_id": 2,
      "label": "balloon",
      "confidence": 0.91,
      "bbox": [x1, y1, x2, y2],
      "mask": "..."
    },
    {
      "id": 3,
      "class_id": 1,
      "label": "text",
      "confidence": 0.89,
      "bbox": [x1, y1, x2, y2],
      "mask": "..."
    }
  ]
}

Out-of-Scope Use

This model should not be treated as:

  • an OCR model,
  • a translation model,
  • a reading-order model by itself,
  • a general natural-image segmentation model,
  • a legal/copyright analysis tool,
  • a safety-critical segmentation system,
  • a perfect layout parser for every comic style.

It only segments visible page regions. It does not understand text content, speaker identity, story context, or translation quality.

How to Get Started with the Model

Install dependencies:

pip install ultralytics pillow opencv-python

Run inference:

from ultralytics import YOLO

# Replace with your local path or the Hugging Face model ID after upload.
model = YOLO("best.pt")

results = model.predict(
    source="example_manga_page.jpg",
    imgsz=1280,
    conf=0.25,
    iou=0.7,
    retina_masks=True,
)

class_names = {
    0: "frame",
    1: "text",
    2: "balloon",
}

for result in results:
    if result.boxes is None:
        print("No regions detected.")
        continue

    for i, box in enumerate(result.boxes):
        class_id = int(box.cls[0])
        confidence = float(box.conf[0])
        bbox = box.xyxy[0].tolist()
        label = class_names.get(class_id, str(class_id))

        print({
            "index": i,
            "class_id": class_id,
            "label": label,
            "confidence": confidence,
            "bbox": bbox,
        })

    # Saves an annotated image with boxes/masks.
    result.save(filename="segmented_output.jpg")

For high-quality mask extraction in a manga translation pipeline, use retina_masks=True during inference so masks are returned at higher resolution.

Training Details

Training Data

This model uses a merged Manga109-derived segmentation dataset with three region classes: frame, text, and balloon.

Dataset Hugging Face ID Use Notes
MangaSegmentation MS92/MangaSegmentation Segmentation annotations for manga regions Dataset card references “Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset.”
Manga109 Region-Level Text Segmentation ShadowB/Manga109_RegionLevelTextSegmentation Region-level text masks Used to support the text class and downstream OCR/translation needs.

Dataset Composition

The provided split audit records a book-level split across 109 manga groups:

Split Books / Groups Images
Train 83 7,174
Validation 12 1,468
Test 14 1,488
Total 109 10,130

The book-level split is important because random page-level splits can overestimate performance by leaking manga-specific style, art, and layout patterns between train and validation data.

Preprocessing

The training data was normalized into a YOLO-compatible segmentation layout with the following class mapping:

0: frame
1: text
2: balloon

Known preprocessing goals:

  • merge Manga109-derived annotations into a common three-class schema,
  • preserve separate panel/frame, text-region, and balloon masks,
  • use a book-level split to better evaluate generalization across manga titles,
  • train in an Ultralytics segmentation format compatible with yolo segment train.

Training Procedure

The model was trained on Kaggle with Ultralytics YOLO26s segmentation. The training script builds a book-level train/validation/test split, maps the labels to three classes (frame, text, balloon), and keeps overlap_mask=False because manga regions can sit inside each other.

Training used yolo26s-seg.pt as the starting checkpoint, image size 1280, batch size 8 across two GPUs, and MuSGD. The run completed 41 epochs and took 11h 13m overall. The checkpoint stores an Ultralytics training time value of 11.0054 hours, which reflects the active training budget rather than the full notebook runtime.

Training Hyperparameters

Hyperparameter Value
Architecture YOLO26s segmentation
Base checkpoint yolo26s-seg.pt
Image size 1280
Batch size 8
Epochs completed 41
Overall run time 11h 13m
Optimizer MuSGD
Learning rate 0.01 initial, 0.01 final factor
Momentum 0.937
Weight decay 0.0005
Warmup epochs 3.0
Cosine LR True
AMP True
Device 0,1
Overlap mask False
Main augmentations mosaic 0.3, copy-paste 0.1, HSV-V 0.04, no flips/rotation

Model Size

Item Value
Checkpoint size, best.pt 23,439,133 bytes
Parameters 11,436,269

Evaluation

Testing Data, Factors & Metrics

The available metrics are from the validation run recorded in best.pt, results.csv, and the validation artifacts in this repository.

  • Validation split: book-level validation split
  • Validation groups/books: 12
  • Validation images: 1,468
  • Test split available in audit: 14 groups/books, 1,488 images
  • Metrics reported: box precision, box recall, box mAP, mask precision, mask recall, mask mAP
  • Artifact source: /validationResultsOfMangaModel/results.csv and checkpoint train_metrics

Recommended factors for further evaluation:

  • unseen manga titles/books,
  • dense vs sparse pages,
  • bordered vs borderless panels,
  • large vs small balloons,
  • small or dense text,
  • heavy screentone regions,
  • low-resolution or compressed pages,
  • overlapping text/balloon/frame regions.

Results

The following values come from the best.pt checkpoint train_metrics field:

Metric Value
Box Precision 0.96521
Box Recall 0.95165
Box mAP@0.5 0.97494
Box mAP@0.5:0.95 0.89988
Mask Precision 0.96564
Mask Recall 0.95026
Mask mAP@0.5 0.97013
Mask mAP@0.5:0.95 0.84573
Validation box loss 0.43638
Validation segmentation loss 0.59429
Validation classification loss 0.26392
Validation DFL loss 0.00241
Fitness 1.74561

The final row in results.csv, epoch 41, records very similar overall metrics:

Metric Epoch 41 Value
Box Precision 0.96489
Box Recall 0.95021
Box mAP@0.5 0.97432
Box mAP@0.5:0.95 0.89907
Mask Precision 0.96627
Mask Recall 0.94811
Mask mAP@0.5 0.96986
Mask mAP@0.5:0.95 0.84459

Per-Class Results

The local artifacts provided here include overall metrics, PR/F1/P/R curves, labels visualization, and confusion matrices. A per-class numeric mAP table was not present in results.csv.

To add per-class metrics, run a validation command that prints or exports per-class results, then update this table:

Class Box mAP@0.5 Box mAP@0.5:0.95 Mask mAP@0.5 Mask mAP@0.5:0.95
frame TODO TODO TODO TODO
text TODO TODO TODO TODO
balloon TODO TODO TODO TODO

Suggested command when the dataset is available:

yolo segment val \
  model=best.pt \
  data=/path/to/data.yaml \
  imgsz=1280 \
  split=val \
  plots=True

Evaluation Artifacts

If the files are uploaded with this model repository, the following artifacts document the run:

Artifact Purpose
results.csv epoch-by-epoch training and validation metrics
results.png metric curves over training
labels.jpg label distribution visualization
confusion_matrix.png confusion matrix
confusion_matrix_normalized.png normalized confusion matrix
BoxPR_curve.png box precision-recall curve
MaskPR_curve.png mask precision-recall curve
BoxF1_curve.png, MaskF1_curve.png F1 curves

Curves and Visual Results

Training curves:

Training and validation results

Labels:

label distribution

Confusion matrix:

confusion_matrix

Normalized confusion matrix:

confusion_matrix_normalized

Mask PR curve:

MaskPR_curve

Box PR curve:

BoxPR_curve

Summary

The model achieves strong validation performance on the book-level validation split, with mask mAP@0.5 of 0.97013 and mask mAP@0.5:0.95 of 0.84573. The model is suitable as a practical manga layout segmentation component, especially for pipelines that need panel, text, and balloon masks before OCR or translation.

For production use, visual inspection is still recommended because manga segmentation quality depends heavily on small text, dense screentones, borderless panels, overlapping regions, and unusual page layouts.

Bias, Risks, and Limitations

This model is specialized for Manga109-style manga pages. It may not generalize well to:

  • Western comics,
  • colored comics,
  • vertical webtoons,
  • very low-resolution scans,
  • pages with unusual layouts,
  • handwritten or highly stylized text,
  • heavily compressed images,
  • non-manga documents,
  • very small text or very thin panel borders.

Known technical limitations:

  • text masks can be sensitive to small font size, dense screentones, and low contrast.
  • balloon masks may be imperfect for irregular balloons, overlapping balloons, or narration boxes.
  • frame predictions can confuse panel borders with artwork lines on complex pages.
  • Validation metrics may not fully capture mask-boundary quality needed for inpainting or redrawing.
  • Even book-level splits may not cover every real-world manga style.

Recommendations

Users should:

  • visually inspect masks before using them in a production translation pipeline,
  • evaluate on their own manga pages before deployment,
  • prefer book/title-level splits for new evaluations,
  • tune confidence and IoU thresholds for their use case,
  • use retina_masks=True when precise masks are needed,
  • combine this model with OCR, reading-order logic, and human review.

Environmental Impact

Carbon emissions were not measured for this run.

  • Hardware Type: multi-GPU Kaggle environment recorded as device=0,1; exact GPU type not recorded in the checkpoint
  • Hours used: about 11.0 hours
  • Cloud Provider: Kaggle
  • Compute Region: Not recorded
  • Carbon Emitted: Not measured

Carbon emissions can be estimated using the Machine Learning Impact calculator: https://mlco2.github.io/impact

Technical Specifications

Model Architecture and Objective

This is a YOLO26s segmentation model with a YOLO-style detection backbone/head and segmentation mask output. The training objective optimizes object detection and instance segmentation losses to predict:

  • bounding boxes,
  • class probabilities,
  • instance segmentation masks.

The checkpoint records:

nc: 3
scale: s
yaml_file: yolo26s-seg.yaml
head: Segment26
stride: 8, 16, 32

Compute Infrastructure

Hardware

  • Training environment: Kaggle
  • Device setting: 0,1
  • Exact GPU model: TPU

Software

  • Python: TODO: add exact version if known
  • PyTorch: TODO: add exact training version if known
  • Ultralytics: 8.4.43 recorded in checkpoint

Data and License Notes

The model repository is licensed under MIT. This does not override the licenses, access restrictions, attribution requirements, or redistribution rules of the datasets used to train/evaluate the model.

Users are responsible for checking and following the terms for:

  • MS92/MangaSegmentation
  • ShadowB/Manga109_RegionLevelTextSegmentation

Important notes:

  • Dataset redistribution may be restricted.
  • MangaSegmentation has its own license/citation requirements.
  • The MIT license applies to this model card/model repository content, not necessarily to the original manga images or dataset annotations.

Citation

If you use this model, cite the relevant datasets and papers.

MangaSegmentation

@inproceedings{xie2025advancing,
  title={Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset},
  author={Minshan Xie and Jian Lin and Hanyuan Liu and Chengze Li and Tien-Tsin Wong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Manga109

@article{aizawa2020building,
  title={Building a Manga Dataset "Manga109" with Annotations for Multimedia Applications},
  author={Aizawa, Kiyoharu and Fujimoto, Azuma and Otsubo, Atsushi and Ogawa, Toru and Matsui, Yusuke and Tsubota, Koki and Ikuta, Hikaru},
  journal={IEEE MultiMedia},
  year={2020}
}

This Model

@misc{shadowb_yolo26s_manga_region_segmentation,
  title={YOLO26s Manga Panel, Text, and Balloon Segmentation},
  author={Abdelhadi marjane},
  year={2026},
  publisher={Hugging Face},
  howpublished={ShadowB/Manga109-panel-Balloon-text-yoloV26-segmentation}}
    }

Glossary

  • Frame / Panel: A visual manga page region containing a scene or layout unit.
  • Text region: The visible text area, usually passed to OCR.
  • Balloon: A speech bubble, thought bubble, narration bubble, or similar text container.
  • Instance segmentation: A task that detects individual objects and predicts a separate mask for each object instance.
  • mAP: Mean Average Precision, a standard detection/segmentation metric.
  • Book-level split: A split where entire manga titles/books are held out, reducing leakage between train and validation data.

More Information

Model Card Authors

Abdelhadi Marjane

Model Card Contact

Abdelhadi Marjane CuratorML is the translation workspace this model was trained for. Issues and PRs are open.

Downloads last month
178
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ShadowB/Manga109-panel-balloon-text-yolov26-segmentation

Evaluation results

  • Box Precision on Book-level validation split from Manga109-derived segmentation data
    self-reported
    0.965
  • Box Recall on Book-level validation split from Manga109-derived segmentation data
    self-reported
    0.952
  • Box mAP@0.5 on Book-level validation split from Manga109-derived segmentation data
    self-reported
    0.975
  • Box mAP@0.5:0.95 on Book-level validation split from Manga109-derived segmentation data
    self-reported
    0.900
  • Mask Precision on Book-level validation split from Manga109-derived segmentation data
    self-reported
    0.966
  • Mask Recall on Book-level validation split from Manga109-derived segmentation data
    self-reported
    0.950
  • Mask mAP@0.5 on Book-level validation split from Manga109-derived segmentation data
    self-reported
    0.970
  • Mask mAP@0.5:0.95 on Book-level validation split from Manga109-derived segmentation data
    self-reported
    0.846