tatar-morph-rubert / README.md
ArabovMK's picture
Update README.md
9908586 verified
metadata
language:
  - tt
  - ru
license: apache-2.0
library_name: transformers
tags:
  - tatar
  - morphology
  - token-classification
  - rubert
  - turkic-languages
  - seqeval
datasets:
  - TatarNLPWorld/tatar-morphological-corpus
metrics:
  - accuracy
  - f1
  - precision
  - recall
widget:
  - text: Мин татарча сөйләшәм
    example_title: Simple sentence
  - text: Кичә мин дусларым белән паркка бардым
    example_title: Complex sentence
  - text: Татарстан  Россия Федерациясе составындагы республика
    example_title: Definition
model-index:
  - name: tatar-morph-rubert
    results:
      - task:
          type: token-classification
          name: Morphological Analysis
        dataset:
          name: TatarNLPWorld/tatar-morphological-corpus
          type: TatarNLPWorld/tatar-morphological-corpus
          split: test
          revision: main
        metrics:
          - type: accuracy
            value: 0.9813
            name: Token Accuracy
          - type: f1
            value: 0.9813
            name: F1-micro
          - type: f1
            value: 0.4737
            name: F1-macro
          - type: precision
            value: 0.9813
            name: Precision (micro)
          - type: recall
            value: 0.9813
            name: Recall (micro)

Model Card for tatar-morph-rubert

RuBERT (Russian BERT) fine‑tuned for morphological analysis of the Tatar language – token‑level prediction of full morphological tags (including part‑of‑speech, number, case, possession, etc.). This model is part of the TatarNLPWorld collection of Turkic and low‑resource language tools.

Model Details

Model Description

  • Developed by: Arabov Mullosharaf Kurbonovich (TatarNLPWorld community)
  • Model type: Transformer‑based token classification (fine‑tuned RuBERT)
  • Language(s) (NLP): Tatar (tt), with some residual Russian influence from the base model
  • License: Apache 2.0
  • Finetuned from model: DeepPavlov/rubert-base-cased
  • Original repository: TatarNLPWorld/tatar-morph-rubert

Model Sources

Uses

Direct Use

The model performs token‑level morphological tagging of Tatar sentences. Given a raw sentence, it returns a list of tokens with the predicted full morphological tags (e.g., N+Sg+Nom, V+Past+3, PUNCT).
Example use cases:

  • Linguistic research and corpus annotation
  • Preprocessing for downstream Tatar NLP tasks (machine translation, information extraction)
  • Educational tools for learning Tatar morphology

Downstream Use

The predicted tags can be used as features in higher‑level systems:

  • Dependency parsing
  • Named entity recognition
  • Text‑to‑speech (grapheme‑to‑phoneme conversion)

Out-of-Scope Use

The model is not intended for:

  • Languages other than Tatar (though it may produce random output for unrelated languages)
  • Grammatical error correction (it only labels existing tokens)
  • Dialectal or historical forms not present in the training corpus

Bias, Risks, and Limitations

  • Training data bias: The model was fine‑tuned on a 60k‑sentence subset of the Tatar morphological corpus, which may under‑represent certain genres (e.g., spoken language, very informal texts) and rare morphological phenomena.
  • Tokenization mismatch: Because the base tokenizer (RuBERT) is not specifically trained on Tatar, some rare words may be split into subwords in a linguistically suboptimal way, potentially affecting tag prediction.
  • Computational resource: The model is a full‑size BERT (∼180M parameters) and may be too heavy for real‑time applications on CPU. Consider using the DistilBERT version for faster inference.

Recommendations

  • Users should evaluate the model on their own domain data before deployment.
  • For highly infrequent word forms, manual verification of predictions is advised.
  • The model may reflect social biases present in the training corpus; use responsibly.

How to Get Started with the Model

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

model_name = "TatarNLPWorld/tatar-morph-rubert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Option A: using pipeline
pipe = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentence = "Мин татарча сөйләшәм."
predictions = pipe(sentence)
for pred in predictions:
    print(f"{pred['word']}: {pred['entity']}")

# Option B: manual inference (see full example in the repository)

For a full inference example with proper word alignment, check the model card appendix or the demo space.

Training Details

Training Data

The model was fine‑tuned on a 60,000‑sentence subset of the TatarNLPWorld/tatar-morphological-corpus.

  • Total sentences (after filtering empty): 59,992
  • Train / validation / test split: 47,993 / 5,999 / 6,000 sentences
  • Tag set size: 1,181 unique morphological tags (full tag sequences, e.g., N+Sg+Nom, V+Past+3, PUNCT)
  • Sampling: Shuffled with seed 42; no further filtering was applied.

Training Procedure

Preprocessing

  • Sentences and their token‑level tags were extracted from the corpus using the official processing script.
  • For transformer models, we used the tokenizer’s is_split_into_words=True mode and aligned labels to the first subword token of each word (-100 for other subwords).
  • Maximum sequence length: 128 tokens (longer sentences were truncated; the median sentence length in the corpus is 6 tokens, so truncation rarely occurs).

Training Hyperparameters

  • Model: DeepPavlov/rubert-base-cased
  • Batch size: 16 (per device) × 1 gradient accumulation step
  • Learning rate: 2e-5
  • Optimizer: AdamW (weight decay 0.01)
  • Warmup steps: 500
  • Number of epochs: 4
  • Mixed precision: FP16 (enabled on GPU)
  • Evaluation strategy: per epoch
  • Save strategy: per epoch, keep best model based on validation token accuracy
  • Early stopping: not used (full 4 epochs)

Speeds, Sizes, Times

  • Hardware: 1× NVIDIA Tesla V100 32GB
  • Training time: ~6.5 hours (for 4 epochs)
  • Model size: ~680 MB (PyTorch checkpoint)
  • Inference speed: ~150 sentences/sec on V100 (batch size 16)

Evaluation

Testing Data, Factors & Metrics

Testing Data

The test set consists of 6,000 sentences (held‑out, not seen during training) from the same corpus. It contains 47,335 tokens that are present in the tag vocabulary (i.e., evaluable tokens).

Metrics

We report standard token‑level classification metrics computed only on tokens that belong to the model’s tag set (others are ignored):

  • Token Accuracy – proportion of correctly predicted tags.
  • Precision / Recall / F1 (micro) – micro‑averaged over all tags.
  • F1 (macro) – macro‑average over tags (treats each tag equally, irrespective of frequency).
  • Confidence intervals – 95% bootstrap intervals (1,000 iterations).

Detailed per‑POS accuracies are available in the results/pos_accuracy.csv file of this repository.

Results

Metric Value 95% CI
Token Accuracy 0.9813 [0.9801, 0.9825]
F1 (micro) 0.9813 [0.9802, 0.9825]
F1 (macro) 0.4737 [0.4524, 0.4945]
Precision (micro) 0.9813 (same as F1 micro)
Recall (micro) 0.9813 (same as F1 micro)

Performance by part‑of‑speech (top 5 frequent POS):

POS Accuracy
PUNCT 1.0000
NOUN 0.9820
VERB 0.9759
ADP 0.9951
ADJ 0.9635

Full POS breakdown is available in results/pos_accuracy.csv.

Summary

RuBERT achieves near‑perfect accuracy on punctuation and very high accuracy on content words, demonstrating strong transfer from Russian to Tatar morphology. The macro F1 is lower because rare tags (e.g., certain combinations of affixes) are harder to predict. Overall, this model is among the best in our series (see comparison in results/model_comparison.png).

Model Examination

We performed a manual error analysis on 100 randomly selected errors. The main error categories:

  1. Rare tag combinations – e.g., verbal forms with multiple affixes that appear only a handful of times.
  2. Ambiguous segmentation – cases where the tokenizer splits a word into subwords and the label alignment fails (rare, <1% of errors).
  3. Out‑of‑vocabulary stems – words unseen during training, leading to guessing.

The model rarely confuses major POS categories; most errors are subtle distinctions within the same POS (e.g., wrong case/number).

Citation

BibTeX:

@misc{tatar-morph-rubert,
  author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
  title = {RuBERT for Tatar Morphological Analysis},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/TatarNLPWorld/tatar-morph-rubert}}
}

APA:

Arabov, M. K., & TatarNLPWorld. (2026). RuBERT for Tatar Morphological Analysis [Model]. Hugging Face. https://huggingface.co/TatarNLPWorld/tatar-morph-rubert

More Information

Model Card Authors

Arabov Mullosharaf Kurbonovich (TatarNLPWorld)

Model Card Contact

https://huggingface.co/TatarNLPWorld