mini-text-detection β€” Khmer & English Text Detection

A YOLO11n-based text detection model fine-tuned to locate and classify text regions in images containing Khmer and English content.
It detects 3 types of text blocks and can be used as the first stage before passing crops to an OCR model (e.g. phonsobon/mini-ocr).


Model Details

Property Value
Architecture YOLO11n (nano)
Task Object Detection β€” 3 classes
Weights file khmer-text-detection-mini.pt
Framework Ultralytics / PyTorch
Input RGB image, any size (auto-resized internally)

Classes

ID Name Khmer Description
0 subject αž€αž˜αŸ’αž˜αžœαžαŸ’αžαž» Title or subject heading
1 reference αž™αŸ„αž„ Reference or citation
2 content αž’αžαŸ’αžαž”αž‘ Main body / paragraph text

Files

File Description
khmer-text-detection-mini.pt Full Ultralytics YOLO model (weights + config)

Quick Start

Install dependencies

pip install ultralytics huggingface_hub

Run inference

from ultralytics import YOLO
from huggingface_hub import hf_hub_download

# ── Download model ────────────────────────────────────────────────────────────
model_path = hf_hub_download(
    repo_id="phonsobon/mini-text-detection",
    filename="khmer-text-detection-mini.pt",
)

# ── Class names ───────────────────────────────────────────────────────────────
CLASS_NAMES = {0: "subject", 1: "reference", 2: "content"}

# ── Load & predict ────────────────────────────────────────────────────────────
model = YOLO(model_path)

results = model.predict(
    source="your_image.jpg",   # path, URL, or numpy array
    conf=0.25,                 # confidence threshold
    iou=0.45,                  # NMS IoU threshold
    imgsz=640,
)

# ── Print results ─────────────────────────────────────────────────────────────
for r in results:
    r.show()                                        # display with bounding boxes
    for box in r.boxes:
        cls_id = int(box.cls)
        label  = CLASS_NAMES[cls_id]
        conf   = float(box.conf)
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"[{label}] conf={conf:.2f}  box=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")

Filter by class

# Get only subject (heading) boxes
subject_boxes = [b for b in results[0].boxes if int(b.cls) == 0]

# Get only content (body) boxes
content_boxes = [b for b in results[0].boxes if int(b.cls) == 2]

Save annotated images

results = model.predict(source="your_image.jpg", save=True, project="runs/detect")
# Saved to runs/detect/predict/

Batch inference on a folder

results = model.predict(source="path/to/images/", conf=0.25, imgsz=640)
for r in results:
    counts = {name: 0 for name in CLASS_NAMES.values()}
    for box in r.boxes:
        counts[CLASS_NAMES[int(box.cls)]] += 1
    print(r.path, "β†’", counts)

Crop + OCR Pipeline

Combine this model with phonsobon/mini-ocr for full end-to-end document reading, with each region labelled by type:

from ultralytics import YOLO
from huggingface_hub import hf_hub_download
from PIL import Image

CLASS_NAMES = {0: "subject", 1: "reference", 2: "content"}

# ── Load detection model ──────────────────────────────────────────────────────
det_path = hf_hub_download("phonsobon/mini-text-detection", "khmer-text-detection-mini.pt")
detector = YOLO(det_path)

# ── Detect text regions ───────────────────────────────────────────────────────
image_path = "your_image.jpg"
results = detector.predict(source=image_path, conf=0.25, imgsz=640)

img = Image.open(image_path).convert("RGB")

# ── Crop each region sorted by class ─────────────────────────────────────────
for i, box in enumerate(results[0].boxes):
    cls_id        = int(box.cls)
    label         = CLASS_NAMES[cls_id]
    x1,y1,x2,y2  = map(int, box.xyxy[0].tolist())

    crop = img.crop((x1, y1, x2, y2))
    crop.save(f"crop_{i}_{label}.png")
    print(f"Saved crop {i} β†’ class: {label}")
    # β†’ feed each crop to phonsobon/mini-ocr for text recognition

Input Tips

  • Works on any image size β€” YOLO resizes internally to 640 px by default.
  • Best results on document photos, screenshots, and scanned pages.
  • Adjust conf (0.1 – 0.5) to trade recall vs. precision depending on your use case.

Limitations

  • May miss very small text (< ~8 px height in the original image).
  • Not designed for handwritten or heavily stylised/artistic fonts.
  • Performance is best on document-style layouts similar to training data.

Related Model

Model Task
phonsobon/mini-ocr Text recognition (CRNN + CTC) for Khmer & English

License

MIT

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support