diacnet-1.0

diacnet-1.0 restores diacritics/accents to text that's been typed or scraped without them, across 10 languages. Fine-tuned from google/byt5-small โ€” character/byte-level rather than word-level, so it handles Yoruba tone marks, Vietnamese combining diacritics, and Polish/Turkish special characters through the same mechanism, no per-language vocabulary needed.

  • Single joint model, all 10 languages โ€” a language tag prefix (<yor>, <vie>, etc.) tells the model which diacritic inventory to apply, no separate models or an upstream language-ID step required.
  • Median CER of ~0.02 across most languages (see Benchmarks) โ€” near-perfect restoration on well-formed input.
  • Fully self-supervised training โ€” no manual annotation. Clean, already- diacritized text is the target; diacritics are deterministically stripped to create the training input.

๐Ÿ—’๏ธ Model Details

Base model google/byt5-small
Architecture Byte-level seq2seq (T5)
Max sequence length 256 bytes (trained on sentence-level examples)
Languages 10 (see below)
Training data olaverse/qg-passages-multi, split into sentences
Training 3 epochs, batch size 16 ร— grad-accum 2, lr 1e-4

Languages: yo vi ig ha pl tr pt es fr it (ISO 639-1; ISO 639-3: yor vie ibo hau pol tur por spa fra ita)

Scoped deliberately to languages where diacritics are lexically meaningful โ€” not applied to the other 15 languages in the source corpus (e.g. Swahili, Zulu, Amharic, Japanese), where diacritic restoration either doesn't apply or isn't the right frame for the script.

๐Ÿƒ Usage

from transformers import AutoTokenizer, T5ForConditionalGeneration

tok = AutoTokenizer.from_pretrained("olaverse/diacnet-1.0")
model = T5ForConditionalGeneration.from_pretrained("olaverse/diacnet-1.0")

text = "<yor> se eranko naa si gbo o?"
inputs = tok(text, return_tensors="pt")
output_ids = model.generate(**inputs, max_new_tokens=256)
print(tok.decode(output_ids[0], skip_special_tokens=True))
# แนฃรฉ แบนranko nรกร  sรฌ gbแปฬ แป?

Prefix input text with the target language tag (<yor>, <vie>, <ibo>, <hau>, <pol>, <tur>, <por>, <spa>, <fra>, <ita>). Works best on single sentences or short passages โ€” see Known Limitations for longer text.

๐Ÿ“Š Benchmarks

Character error rate (CER, lower is better) on a held-out validation split (5% of training data, not seen during training). Sentence-level examples.

Language CER n
por 0.013 750
spa 0.013 845
fra 0.016 587
pol 0.016 1,137
ita 0.022 269
ibo 0.030 1,672
tur 0.033 1,206
hau 0.038 15
vie 0.063 1,832
yor 0.110 1,583

โš ๏ธ Known limitations

  • Yoruba CER is notably higher than the other 9 languages โ€” nearly 3x the next-highest score. Qualitative inspection shows this is driven almost entirely by genuine tonal ambiguity, not model weakness: Yoruba diacritics mark tone (low/mid/high pitch), and the same base letter sequence can correspond to multiple valid tone patterns depending on word sense or context, sometimes unrecoverable from text alone. Example from validation: target แปฬfรญรฌsรฌ (high tone) vs. predicted แปฬ€fรญรฌsรฌ (low tone) on an English loanword ("office") โ€” everything else in the same sentence, including several other tone-marked words, was restored correctly. Most Yoruba errors are single-diacritic misses like this on an otherwise correctly restored sentence, not systematic failure.
  • Hausa's 0.038 CER is based on only 15 validation examples โ€” too small a sample to treat as a reliable estimate. Hausa was underrepresented in the source corpus relative to the other 9 languages; treat this number as indicative at best until evaluated on more data.
  • Trained on machine/teacher-generated text (Aya-Collection-derived passages), not human-authored or casually-typed text โ€” accuracy on real-world messy input (mixed scripts, typos, non-standard spelling) is untested.
  • Trained and evaluated on sentence-length input (median 58 bytes, p90 162 bytes). Longer multi-sentence passages should be split into sentences before inference for best results, rather than passed in as one long input.

Training data & licensing

Fine-tuned from google/byt5-small (Apache-2.0) on olaverse/qg-passages-multi (Apache-2.0), split into sentences and paired with a diacritic-stripped copy of each sentence as the self-supervised training input. Released under Apache-2.0.

Citation

@misc{diacnet-1.0,
  title  = {diacnet-1.0},
  author = {Olaverse},
  year   = {2026},
  url    = {https://huggingface.co/olaverse/diacnet-1.0}
}
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for olaverse/diacnet-1.0

Finetuned
(291)
this model

Collection including olaverse/diacnet-1.0