diacnet-1.0

diacnet-1.0 restores diacritics/accents to text that's been typed or scraped without them, across 10 languages. Fine-tuned from google/byt5-small — character/byte-level rather than word-level, so it handles Yoruba tone marks, Vietnamese combining diacritics, and Polish/Turkish special characters through the same mechanism, no per-language vocabulary needed.

Single joint model, all 10 languages — a language tag prefix (<yor>, <vie>, etc.) tells the model which diacritic inventory to apply, no separate models or an upstream language-ID step required.
Median CER of ~0.02 across most languages (see Benchmarks) — near-perfect restoration on well-formed input.
Fully self-supervised training — no manual annotation. Clean, already- diacritized text is the target; diacritics are deterministically stripped to create the training input.

🗒️ Model Details


Base model	`google/byt5-small`
Architecture	Byte-level seq2seq (T5)
Max sequence length	256 bytes (trained on sentence-level examples)
Languages	10 (see below)
Training data	`olaverse/qg-passages-multi`, split into sentences
Training	3 epochs, batch size 16 × grad-accum 2, lr 1e-4

Languages: yo vi ig ha pl tr pt es fr it (ISO 639-1; ISO 639-3: yor vie ibo hau pol tur por spa fra ita)

Scoped deliberately to languages where diacritics are lexically meaningful — not applied to the other 15 languages in the source corpus (e.g. Swahili, Zulu, Amharic, Japanese), where diacritic restoration either doesn't apply or isn't the right frame for the script.

🏃 Usage

from transformers import AutoTokenizer, T5ForConditionalGeneration

tok = AutoTokenizer.from_pretrained("olaverse/diacnet-1.0")
model = T5ForConditionalGeneration.from_pretrained("olaverse/diacnet-1.0")

text = "<yor> se eranko naa si gbo o?"
inputs = tok(text, return_tensors="pt")
output_ids = model.generate(**inputs, max_new_tokens=256)
print(tok.decode(output_ids[0], skip_special_tokens=True))
# ṣé ẹranko náà sì gbọ́ ọ?

Prefix input text with the target language tag (<yor>, <vie>, <ibo>, <hau>, <pol>, <tur>, <por>, <spa>, <fra>, <ita>). Works best on single sentences or short passages — see Known Limitations for longer text.

📊 Benchmarks

Character error rate (CER, lower is better) on a held-out validation split (5% of training data, not seen during training). Sentence-level examples.

Language	CER	n
por	0.013	750
spa	0.013	845
fra	0.016	587
pol	0.016	1,137
ita	0.022	269
ibo	0.030	1,672
tur	0.033	1,206
hau	0.038	15
vie	0.063	1,832
yor	0.110	1,583

⚠️ Known limitations

Yoruba CER is notably higher than the other 9 languages — nearly 3x the next-highest score. Qualitative inspection shows this is driven almost entirely by genuine tonal ambiguity, not model weakness: Yoruba diacritics mark tone (low/mid/high pitch), and the same base letter sequence can correspond to multiple valid tone patterns depending on word sense or context, sometimes unrecoverable from text alone. Example from validation: target ọ́fíìsì (high tone) vs. predicted ọ̀fíìsì (low tone) on an English loanword ("office") — everything else in the same sentence, including several other tone-marked words, was restored correctly. Most Yoruba errors are single-diacritic misses like this on an otherwise correctly restored sentence, not systematic failure.
Hausa's 0.038 CER is based on only 15 validation examples — too small a sample to treat as a reliable estimate. Hausa was underrepresented in the source corpus relative to the other 9 languages; treat this number as indicative at best until evaluated on more data.
Trained on machine/teacher-generated text (Aya-Collection-derived passages), not human-authored or casually-typed text — accuracy on real-world messy input (mixed scripts, typos, non-standard spelling) is untested.
Trained and evaluated on sentence-length input (median 58 bytes, p90 162 bytes). Longer multi-sentence passages should be split into sentences before inference for best results, rather than passed in as one long input.

Training data & licensing

Fine-tuned from google/byt5-small (Apache-2.0) on olaverse/qg-passages-multi (Apache-2.0), split into sentences and paired with a diacritic-stripped copy of each sentence as the self-supervised training input. Released under Apache-2.0.

Citation

@misc{diacnet-1.0,
  title  = {diacnet-1.0},
  author = {Olaverse},
  year   = {2026},
  url    = {https://huggingface.co/olaverse/diacnet-1.0}
}

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for olaverse/diacnet-1.0

Base model

google/byt5-small

Finetuned

(291)

this model

Collection including olaverse/diacnet-1.0

DiacNet

Collection

6 items • Updated about 11 hours ago