Training a GPT-2 Language Model from Scratch for Moroccan Darija: An Educational Experiment in Low-Resource NLP
Affiliation: Typica.ai – AI Research for Underserved Languages
Abstract
This paper reports on a research experiment conducted in October 2024 to train a GPT-2 language model entirely from scratch for Moroccan Darija, a spoken Arabic dialect lacking standardization, resources, and NLP tools. Using a single NVIDIA A100 GPU, we trained the model on a monolingual dataset of ~14.5M examples. While this approach is not optimal for producing state-of-the-art results compared to transfer learning, we present it as an educational and exploratory investigation into what training from scratch can reveal about data quality, linguistic representation, and LLM behavior in low-resource settings. We outline the methodology, decoding strategies, evaluation results, and lessons learned, with a focus on typological insights, prompt sensitivity, and cultural adaptation.
1. Introduction
The rise of large language models (LLMs) has opened exciting possibilities for multilingual and cross-cultural AI. However, most LLMs are trained on high-resource languages and generalized dialects. Moroccan Darija, spoken by over 35 million people, remains almost entirely absent from major language model benchmarks.
This paper presents an educational experiment in which a GPT-2 model was trained from scratch on Darija using limited but real-world constraints. We intentionally avoided using pre-trained Arabic or multilingual models to investigate the fundamentals of language modeling when no base model exists and to understand how much a model can learn purely from native data, however limited or noisy.
Our research goal was not to compete with large-scale LLMs, but to:
- Explore the boundaries of representation with Darija-only pretraining
- Assess the feasibility of building dialect-specific LMs with minimal infrastructure
- Gain insight into how language models internalize informal, non-standardized languages
This effort is aligned with Typica.ai's mission to develop culturally grounded AI solutions for underserved linguistic communities.
2. Motivation for Training from Scratch
In modern NLP, pretraining from scratch is often discouraged due to compute costs, data requirements, and performance disadvantages compared to fine-tuning large base models (e.g., LLaMA, Mistral, AraGPT2, etc.).
However, training from scratch offers several research benefits, especially in educational and exploratory contexts:
- It provides full control over tokenizer, vocabulary, and initialization, which is critical for dialects like Darija with unique orthographic and phonetic patterns.
- It avoids bias inherited from other dialects or formal Arabic (MSA) that might pollute downstream behavior.
- It surfaces issues in data quality, overfitting, and convergence more clearly, giving researchers a transparent view of the modeling dynamics.
Moreover, the lack of high-quality Darija base models meant that even fine-tuning was limited in value. Our goal was to produce a minimal viable pre-trained model, capable of learning Darija grammar, lexical patterns, and cultural cues — even if imperfect.
3. Dataset and Preprocessing
3.1 Data Collection
We curated a custom corpus of ~14.5 million Darija text samples, drawn from informal Moroccan forums, social platforms, and public websites. The data reflects a wide spectrum of domains (politics, religion, lifestyle), though with noticeable style redundancy.
3.2 Data Characteristics
- Highly informal and inconsistent orthography
- Frequent code-switching with French and Modern Standard Arabic (MSA)
- Repetitive structures in news and social content
3.3 Dataset Statistics
DatasetDict({
train: 14,557,876 examples,
validation: 574,653 examples,
test: 191,551 examples
})
4. Model Architecture and Training Setup
4.1 Configuration
- Base Architecture: GPT-2 (117M parameter equivalent)
- Tokenizer:
GPT2TokenizerFast
trained from scratch on the Darija dataset - Block size: 1024
- Vocabulary size: 49,152
- Batch size: 32
- Precision: FP16 (mixed)
- Max training steps: 520,000
- Eval/Save frequency: every 5,000 steps
4.2 Hardware and Environment
- GPU: NVIDIA A100 (40GB VRAM)
- Compute: Single-GPU setup (no model/data parallelism)
- Framework: Hugging Face Transformers (
Trainer
API)
Training duration: ~3 weeks (non-continuous sessions with checkpointing).
5. Evaluation Methodology
5.1 Inference Parameters
- Beam Search (
num_beams=5
) - Top-k Sampling (
top_k=50
) - Top-p Sampling (
top_p=0.9
) - Temperature variation (0.6 – 1.0)
- Best-of-N generation with multiple return sequences
5.2 Prompt Design
Prompts were manually designed to reflect culturally grounded narratives and common Darija phrasing.
Prompt:
الطقس اليوم فالمغرب كان مشمس ودافي، والناس قررو يخرجو للحديقة باش يدوزو وقت زوين مع العائلة. فاش وصلو،
Generated Output (Beam Search):
لقاو الجو بارد بزاف، خصوصا أن الشتا كانت خيط من السما، والنهار بدا كيطيح، وبداو الناس كيخرجو من ديورهم، حتى وصلو للشارع الرئيسي ديال المدينة، وبقاو غير الناس قلال.
"بالصور. الشتا بدات كتطيح.. الشتا مجهدة بزاف"
5.3 Human Evaluation
Evaluated based on:
- Fluency (linguistic smoothness)
- Coherence (semantic continuity)
- Cultural relevance (local references, idiomatic accuracy)
6. Key Insights
✅ Successes
- Learned contextual narrative structure and local expressions
- Beam Search surprisingly produced strong coherence
- Avoided contamination from MSA-style overfitting
- Reflected social and cultural cues from Moroccan daily life
❌ Limitations
This experiment, while encouraging in terms of cultural grounding and surface-level fluency, revealed several limitations that are characteristic of small, from-scratch LLMs trained in low-resource settings:
Narrative Coherence Breakdown
The model is able to follow short-term context, but often loses coherence beyond 60–80 tokens. Transitions between events can become abrupt or logically inconsistent, especially in open-ended prompts with multiple actors or locations.
Repetitive and Redundant Phrasing
Generated outputs frequently repeat ideas or phrases, suggesting overfitting to high-frequency patterns in the training data. This is a common issue when datasets lack stylistic and structural variety.
Template Leakage from Training Sources
The model shows signs of "template leakage," especially from news-style or headline-driven data. While this mimicking behavior may enhance fluency, it reduces contextual appropriateness in informal or narrative settings.
Shallow World Modeling
Without curated or diverse training examples, the model struggles to simulate plausible real-world events. It tends to generate surface-level sequences rather than imaginative or causally consistent progressions.
Orthographic Sensitivity
Due to the lack of standardized spelling in Darija, tokenization inefficiencies were observed. The model’s understanding of semantically equivalent variants is weak, leading to fragmented token representations and reduced generalization.
Evaluation Scope
Current evaluation is primarily qualitative and based on human judgment. While this provides insights into fluency and cultural relevance, further quantitative evaluation (e.g., perplexity, coverage metrics) would be needed for benchmark comparison.
7. Conclusion and Future Work
This experiment demonstrates that training LLMs from scratch for dialectal, low-resource languages is feasible with a single A100 GPU and a clean training pipeline. While the model is not ready for deployment, it serves as a valuable research artifact.
Future directions:
- Instruction tuning for downstream QA/chat tasks
- Improved dataset curation with spelling normalization
- Incorporating audio-aligned transcriptions (spoken Darija)
- Scaling to multilingual Darija–French–Arabic models
At Typica.ai, we believe these foundational experiments are essential for building inclusive, locally relevant NLP systems.
Acknowledgments
This work was conducted as part of Typica.ai’s research stream focused on AI for underserved languages. Thanks to the Hugging Face community and all open-source contributors.
References
Author Bio
Hicham Assoudi is an AI researcher, Oracle expert, author and founder of Typica.ai, a startup committed to building NLP tools for low-resource languages. He holds a Ph.D. in Artificial Intelligence and is an External Research Associate at UQAM's AI Lab (CRIA) in Montreal.
Contact
For questions, collaborations, or feedback, feel free to reach out:
📧 Email: assoudi@typica.ai
🌐 Website: https://typica.ai
🔗 LinkedIn: linkedin.com/in/assoudi