CNN-CTC OCR for North Sámi (sme)
Lightweight CNN + CTC line-level OCR model for North Sámi (sme). Trained from scratch on the Språkbanken synthetic North Sámi corpus.
- Architecture: 7-layer SimpleCNN backbone → adaptive column pooling → linear projection → CTC head. No RNN encoder.
- Parameters: 5,785,100 (~23 MB checkpoint)
- Vocabulary: 395 characters + CTC blank
- Input: grayscale line image, 32 × 800 px
- Training data: 276,649 lines (Språkbanken synthetic, validation 30,738)
- Optimiser: AdamW, lr=1e-4, weight_decay=1e-4, batch=32, 100 epochs,
ReduceLROnPlateau, grad-clip 5.0
Results
| Split | CER | WER | Char acc |
|---|---|---|---|
| Språkbanken synthetic (val, 30,738 lines) | 2.04% | 7.60% | 80.76% |
| Benchmark test set (1,048 lines) | 12.24% | 37.35% | 87.72% |
The 12.24% CER figure is the headline number reported in the accompanying papers. The lower 2.04% CER is the in-distribution validation result during training.
Usage
The checkpoint is a plain torch.save pickle containing the state dict, character vocabulary, and architecture config. Loading requires the model code from the source repo: https://github.com/magwrap/sami-ocr-translate.
import torch
from huggingface_hub import hf_hub_download
from src.ocr.models import load_checkpoint
ckpt_path = hf_hub_download(
repo_id="magwrap/cnn-ctc-ocr-sme",
filename="checkpoint_best.pt",
)
model, char_to_idx, idx_to_char, config = load_checkpoint(ckpt_path, device="cpu")
For preprocessing and greedy CTC decoding see src/ocr/pipeline.py in the source repo.
Training
Full per-epoch loss / CER / WER curves are in train.log. The training command is in config.json. Best checkpoint was saved at epoch 97 (val CER 0.0204).
Intended use and limitations
Citation
If you use this model, please cite the accompanying paper(s) — see https://github.com/magwrap/sami-ocr-translate for the current reference.
License
Apache-2.0. Training data: Språkbanken synthetic North Sámi corpus — see source dataset for its own terms.
- Downloads last month
- 16