DistilBERT Kabyle/Tachelhit Dialect Classifier
This model is a fine-tuned version of distilbert-base-multilingual-cased optimized to act as a high-precision linguistic gatekeeper. Its primary purpose is to filter out mislabeled Moroccan Tachelhit (Tashelhit/Tacelḥit) text blocks from web-scraped Algerian Kabyle (Teqbaylit) corpora, specifically targeting data cleaning pipelines like HPLT 3.0.
Model Description
- Developed by: boffire
- Language(s) (NLP): Kabyle (kab_Latn), Tachelhit (shi_Latn)
- Model Type: Sequence Classification (Binary)
- Base Model: distilbert-base-multilingual-cased
- License: MIT
Intended Uses & Limitations
Intended Use
This model is designed to automate dataset sanitization tasks. By processing raw text streams, it dynamically maps out structural grammar syntax patterns (e.g., detecting particles like Kabyle deg/di/nneɣ vs. Tachelhit ɣ/tmdint) to isolate or group specific Amazigh variants.
Known Limitations & Boundary Constraints
While this model exhibits high precision on structured text corpora, developers should account for the following architectural constraints when deploying it at scale:
- Context-Length Dependency: The classification boundary relies heavily on contextual density and adjacent particle distribution (e.g., detecting
degvsɣ). Consequently, accuracy may degrade significantly when evaluating short, isolated text phrases, web navigation tokens, or single-word inputs. - Deterministic Labeling Biases: Because the training pipeline utilized weak supervision (automated heuristic rule-bases) to collect data, the model may manifest minor blind spots for highly non-standardized orthographies, colloquial social media variants, or regional typos that lack the target anchor tokens.
- Binary Forcing of Out-of-Domain Dialects: This model operates strictly as a binary classifier ($num_labels=2$) optimized to segregate Algerian Kabyle from Moroccan Tachelhit. If it is exposed to third-party Amazigh regional variants (such as Chaoui, Mozabite, or Rifian), it will forcefully route them into one of the two target classes, potentially creating latent false positives in down-stream datasets.
Training Performance Baselines
The model converged rapidly under a balanced weak-supervision distribution environment on a T4 GPU, yielding the following validation metrics:
- Validation Accuracy: 96.10%
- Precision: 97.38%
- Recall: 96.75%
- F1-Score: 97.06%
Quickstart Usage
You can load and test this model instantly within any Python setup using the Hugging Face transformers pipeline:
from transformers import pipeline
# Initialize the live classifier node
classifier = pipeline("text-classification", model="boffire/distilbert-kabyle-tachelhit-classifier")
# Test Phrase 1: Kabyle Text
print(classifier("Tutlayt n yemma d babba d ayen akk i d-neǧǧa i warraw-nneɣ."))
# Output: [{'label': 'LABEL_1', 'score': 0.9527423977851868}] -> True Kabyle (Keep)
# Test Phrase 2: Tachelhit Text
print(classifier("Rad darnɣ yili yan unmuggar ɣ tgmmi nns taggʷat ad."))
# Output: [{'label': 'LABEL_0', 'score': 0.8063703775405884}] -> Mislabeled Tachelhit (Filter Out)
- Downloads last month
- 56