Training a GPT-2 Language Model from Scratch for Moroccan Darija: An Educational Experiment in Low-Resource NLP

Community Article Published April 4, 2025

Affiliation: Typica.ai – AI Research for Underserved Languages

Abstract

This paper reports on a research experiment conducted in October 2024 to train a GPT-2 language model entirely from scratch for Moroccan Darija, a spoken Arabic dialect lacking standardization, resources, and NLP tools. Using a single NVIDIA A100 GPU, we trained the model on a monolingual dataset of ~14.5M examples. While this approach is not optimal for producing state-of-the-art results compared to transfer learning, we present it as an educational and exploratory investigation into what training from scratch can reveal about data quality, linguistic representation, and LLM behavior in low-resource settings. We outline the methodology, decoding strategies, evaluation results, and lessons learned, with a focus on typological insights, prompt sensitivity, and cultural adaptation.

1. Introduction

The rise of large language models (LLMs) has opened exciting possibilities for multilingual and cross-cultural AI. However, most LLMs are trained on high-resource languages and generalized dialects. Moroccan Darija, spoken by over 35 million people, remains almost entirely absent from major language model benchmarks.

This paper presents an educational experiment in which a GPT-2 model was trained from scratch on Darija using limited but real-world constraints. We intentionally avoided using pre-trained Arabic or multilingual models to investigate the fundamentals of language modeling when no base model exists and to understand how much a model can learn purely from native data, however limited or noisy.

Our research goal was not to compete with large-scale LLMs, but to:

Explore the boundaries of representation with Darija-only pretraining
Assess the feasibility of building dialect-specific LMs with minimal infrastructure
Gain insight into how language models internalize informal, non-standardized languages

This effort is aligned with Typica.ai's mission to develop culturally grounded AI solutions for underserved linguistic communities.

2. Motivation for Training from Scratch

In modern NLP, pretraining from scratch is often discouraged due to compute costs, data requirements, and performance disadvantages compared to fine-tuning large base models (e.g., LLaMA, Mistral, AraGPT2, etc.).

However, training from scratch offers several research benefits, especially in educational and exploratory contexts:

It provides full control over tokenizer, vocabulary, and initialization, which is critical for dialects like Darija with unique orthographic and phonetic patterns.
It avoids bias inherited from other dialects or formal Arabic (MSA) that might pollute downstream behavior.
It surfaces issues in data quality, overfitting, and convergence more clearly, giving researchers a transparent view of the modeling dynamics.

Moreover, the lack of high-quality Darija base models meant that even fine-tuning was limited in value. Our goal was to produce a minimal viable pre-trained model, capable of learning Darija grammar, lexical patterns, and cultural cues — even if imperfect.

3. Dataset and Preprocessing

3.1 Data Collection

We curated a custom corpus of ~14.5 million Darija text samples, drawn from informal Moroccan forums, social platforms, and public websites. The data reflects a wide spectrum of domains (politics, religion, lifestyle), though with noticeable style redundancy.

3.2 Data Characteristics

Highly informal and inconsistent orthography
Frequent code-switching with French and Modern Standard Arabic (MSA)
Repetitive structures in news and social content

3.3 Dataset Statistics

DatasetDict({
  train:      14,557,876 examples,
  validation:    574,653 examples,
  test:          191,551 examples
})

4. Model Architecture and Training Setup

4.1 Configuration

Base Architecture: GPT-2 (117M parameter equivalent)
Tokenizer: GPT2TokenizerFast trained from scratch on the Darija dataset
Block size: 1024
Vocabulary size: 49,152
Batch size: 32
Precision: FP16 (mixed)
Max training steps: 520,000
Eval/Save frequency: every 5,000 steps

4.2 Hardware and Environment

GPU: NVIDIA A100 (40GB VRAM)
Compute: Single-GPU setup (no model/data parallelism)
Framework: Hugging Face Transformers (Trainer API)

Training duration: ~3 weeks (non-continuous sessions with checkpointing).

5. Evaluation Methodology

5.1 Inference Parameters

Beam Search (num_beams=5)
Top-k Sampling (top_k=50)
Top-p Sampling (top_p=0.9)
Temperature variation (0.6 – 1.0)
Best-of-N generation with multiple return sequences

5.2 Prompt Design

Prompts were manually designed to reflect culturally grounded narratives and common Darija phrasing.

Prompt:

الطقس اليوم فالمغرب كان مشمس ودافي، والناس قررو يخرجو للحديقة باش يدوزو وقت زوين مع العائلة. فاش وصلو،

Generated Output (Beam Search):

لقاو الجو بارد بزاف، خصوصا أن الشتا كانت خيط من السما، والنهار بدا كيطيح، وبداو الناس كيخرجو من ديورهم، حتى وصلو للشارع الرئيسي ديال المدينة، وبقاو غير الناس قلال.
"بالصور. الشتا بدات كتطيح.. الشتا مجهدة بزاف"

5.3 Human Evaluation

Evaluated based on:

Fluency (linguistic smoothness)
Coherence (semantic continuity)
Cultural relevance (local references, idiomatic accuracy)

6. Key Insights

✅ Successes

Learned contextual narrative structure and local expressions
Beam Search surprisingly produced strong coherence
Avoided contamination from MSA-style overfitting
Reflected social and cultural cues from Moroccan daily life

❌ Limitations

This experiment, while encouraging in terms of cultural grounding and surface-level fluency, revealed several limitations that are characteristic of small, from-scratch LLMs trained in low-resource settings:

Narrative Coherence Breakdown

The model is able to follow short-term context, but often loses coherence beyond 60–80 tokens. Transitions between events can become abrupt or logically inconsistent, especially in open-ended prompts with multiple actors or locations.

Repetitive and Redundant Phrasing

Generated outputs frequently repeat ideas or phrases, suggesting overfitting to high-frequency patterns in the training data. This is a common issue when datasets lack stylistic and structural variety.

Template Leakage from Training Sources

The model shows signs of "template leakage," especially from news-style or headline-driven data. While this mimicking behavior may enhance fluency, it reduces contextual appropriateness in informal or narrative settings.

Shallow World Modeling

Without curated or diverse training examples, the model struggles to simulate plausible real-world events. It tends to generate surface-level sequences rather than imaginative or causally consistent progressions.

Orthographic Sensitivity

Due to the lack of standardized spelling in Darija, tokenization inefficiencies were observed. The model’s understanding of semantically equivalent variants is weak, leading to fragmented token representations and reduced generalization.

Evaluation Scope

Current evaluation is primarily qualitative and based on human judgment. While this provides insights into fluency and cultural relevance, further quantitative evaluation (e.g., perplexity, coverage metrics) would be needed for benchmark comparison.

7. Conclusion and Future Work

This experiment demonstrates that training LLMs from scratch for dialectal, low-resource languages is feasible with a single A100 GPU and a clean training pipeline. While the model is not ready for deployment, it serves as a valuable research artifact.

Future directions:

Instruction tuning for downstream QA/chat tasks
Improved dataset curation with spelling normalization
Incorporating audio-aligned transcriptions (spoken Darija)
Scaling to multilingual Darija–French–Arabic models

At Typica.ai, we believe these foundational experiments are essential for building inclusive, locally relevant NLP systems.

Acknowledgments

This work was conducted as part of Typica.ai’s research stream focused on AI for underserved languages. Thanks to the Hugging Face community and all open-source contributors.

References

Author Bio

Hicham Assoudi is an AI researcher, Oracle expert, author and founder of Typica.ai, a startup committed to building NLP tools for low-resource languages. He holds a Ph.D. in Artificial Intelligence and is an External Research Associate at UQAM's AI Lab (CRIA) in Montreal.

Contact

For questions, collaborations, or feedback, feel free to reach out:

📧 Email: assoudi@typica.ai
🌐 Website: https://typica.ai
🔗 LinkedIn: linkedin.com/in/assoudi

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote