DistilBERT Kabyle/Tachelhit Dialect Classifier

This model is a fine-tuned version of distilbert-base-multilingual-cased optimized to act as a high-precision linguistic gatekeeper. Its primary purpose is to filter out mislabeled Moroccan Tachelhit (Tashelhit/Tacelḥit) text blocks from web-scraped Algerian Kabyle (Teqbaylit) corpora, specifically targeting data cleaning pipelines like HPLT 3.0.

Model Description

Developed by: boffire
Language(s) (NLP): Kabyle (kab_Latn), Tachelhit (shi_Latn)
Model Type: Sequence Classification (Binary)
Base Model: distilbert-base-multilingual-cased
License: MIT

Intended Uses & Limitations

Intended Use

This model is designed to automate dataset sanitization tasks. By processing raw text streams, it dynamically maps out structural grammar syntax patterns (e.g., detecting particles like Kabyle deg/di/nneɣ vs. Tachelhit ɣ/tmdint) to isolate or group specific Amazigh variants.

Known Limitations & Boundary Constraints

While this model exhibits high precision on structured text corpora, developers should account for the following architectural constraints when deploying it at scale:

Context-Length Dependency: The classification boundary relies heavily on contextual density and adjacent particle distribution (e.g., detecting deg vs ɣ). Consequently, accuracy may degrade significantly when evaluating short, isolated text phrases, web navigation tokens, or single-word inputs.
Deterministic Labeling Biases: Because the training pipeline utilized weak supervision (automated heuristic rule-bases) to collect data, the model may manifest minor blind spots for highly non-standardized orthographies, colloquial social media variants, or regional typos that lack the target anchor tokens.
Binary Forcing of Out-of-Domain Dialects: This model operates strictly as a binary classifier ($num_labels=2$) optimized to segregate Algerian Kabyle from Moroccan Tachelhit. If it is exposed to third-party Amazigh regional variants (such as Chaoui, Mozabite, or Rifian), it will forcefully route them into one of the two target classes, potentially creating latent false positives in down-stream datasets.

Training Performance Baselines

The model converged rapidly under a balanced weak-supervision distribution environment on a T4 GPU, yielding the following validation metrics:

Validation Accuracy: 96.10%
Precision: 97.38%
Recall: 96.75%
F1-Score: 97.06%

Quickstart Usage

You can load and test this model instantly within any Python setup using the Hugging Face transformers pipeline:

from transformers import pipeline

# Initialize the live classifier node
classifier = pipeline("text-classification", model="boffire/distilbert-kabyle-tachelhit-classifier")

# Test Phrase 1: Kabyle Text
print(classifier("Tutlayt n yemma d babba d ayen akk i d-neǧǧa i warraw-nneɣ.")) 
# Output: [{'label': 'LABEL_1', 'score': 0.9527423977851868}] -> True Kabyle (Keep)

# Test Phrase 2: Tachelhit Text
print(classifier("Rad darnɣ yili yan unmuggar ɣ tgmmi nns taggʷat ad.")) 
# Output: [{'label': 'LABEL_0', 'score': 0.8063703775405884}] -> Mislabeled Tachelhit (Filter Out)

Downloads last month: 56

Safetensors

Model size

0.1B params

Tensor type

F32