DiacNetYorX

GitHub Repository

DiacNetYorX is a state-of-the-art transformer-based sequence classifier fine-tuned on top of castorini/afriberta_large for Yoruba tonal diacritization.

Instead of classifying over the global vocabulary, it classifies the candidate index (0 to 7) of each plain word, which optimizes the search space, prevents overfitting, and handles rare tokens gracefully.

Model Details

  • Base Model: castorini/afriberta_large (125M parameters)
  • Model Type: Transformer Sequence Classification (Candidate Index Ranking)
  • File Size: 503.56 MB (diacnet_yor_x.pt)
  • Metrics:
    • Validation Word Accuracy: 82.46%
    • Test Word Accuracy: 78.26%
  • Dependencies: PyTorch, Transformers

Usage

Loaded and used via the unified olaverse SDK wrapper (automatically downloads the weights and loads the Transformer model in the background):

from olaverse.nlp.diacritizer import Diacritizer

diacritizer = Diacritizer(model="diacnet-yor-x")
text = "Ojo lo si oja lana"
print(diacritizer.restore(text))
# Output: "Ọjọ́ ló sí ọjà lànà"

Files

  • diacnet_yor_x.pt: PyTorch model weights.
  • diacnet_yor_x_vocab.json: The word candidate list mapping.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including olaverse/diacnet-yor-x