CNN-CTC OCR for North Sámi (sme)

Lightweight CNN + CTC line-level OCR model for North Sámi (sme). Trained from scratch on the Språkbanken synthetic North Sámi corpus.

Architecture: 7-layer SimpleCNN backbone → adaptive column pooling → linear projection → CTC head. No RNN encoder.
Parameters: 5,785,100 (~23 MB checkpoint)
Vocabulary: 395 characters + CTC blank
Input: grayscale line image, 32 × 800 px
Training data: 276,649 lines (Språkbanken synthetic, validation 30,738)
Optimiser: AdamW, lr=1e-4, weight_decay=1e-4, batch=32, 100 epochs, ReduceLROnPlateau, grad-clip 5.0

Results

Split	CER	WER	Char acc
Språkbanken synthetic (val, 30,738 lines)	2.04%	7.60%	80.76%
Benchmark test set (1,048 lines)	12.24%	37.35%	87.72%

The 12.24% CER figure is the headline number reported in the accompanying papers. The lower 2.04% CER is the in-distribution validation result during training.

Usage

The checkpoint is a plain torch.save pickle containing the state dict, character vocabulary, and architecture config. Loading requires the model code from the source repo: https://github.com/magwrap/sami-ocr-translate.

import torch
from huggingface_hub import hf_hub_download
from src.ocr.models import load_checkpoint

ckpt_path = hf_hub_download(
    repo_id="magwrap/cnn-ctc-ocr-sme",
    filename="checkpoint_best.pt",
)
model, char_to_idx, idx_to_char, config = load_checkpoint(ckpt_path, device="cpu")

For preprocessing and greedy CTC decoding see src/ocr/pipeline.py in the source repo.

Training

Full per-epoch loss / CER / WER curves are in train.log. The training command is in config.json. Best checkpoint was saved at epoch 97 (val CER 0.0204).

Intended use and limitations

Citation

If you use this model, please cite the accompanying paper(s) — see https://github.com/magwrap/sami-ocr-translate for the current reference.

License

Apache-2.0. Training data: Språkbanken synthetic North Sámi corpus — see source dataset for its own terms.

Downloads last month: 16