SlangGPT: Egyptian Arabic โ†’ Modern Standard Arabic (MSA)

SlangGPT is a fine-tuned AraGPT-2-medium model that translates Egyptian Arabic slang/dialect into Modern Standard Arabic (MSA).

It is part of the broader SlangGPT project โ€” an end-to-end Arabic NLP system for dialect translation and translation verification.


๐Ÿ“„ Project Resources


๐Ÿง  Model Description

SlangGPT is a decoder-only causal language model built on top of:

  • Base model: aubmindlab/aragpt2-medium

The model was fine-tuned on Egyptian Arabic โ†” MSA parallel text using conditional autoregressive training.

Prompt Format

dialect: {input} โ†” msa:

The model generates the Modern Standard Arabic translation autoregressively.


โœจ Key Features

  • Input: Egyptian Arabic slang/dialect
  • Output: Modern Standard Arabic (MSA)
  • Architecture: GPT-2 style decoder-only transformer
  • Tokenizer: BPE tokenizer with 64k vocabulary
  • Context length: 1024 tokens
  • Language: Arabic

โš™๏ธ Training Configuration

Parameter Value
Batch size 8 (effective 32)
Learning rate 5e-5
Scheduler Cosine
Warmup 10%
Gradient clipping 1.0

๐ŸŽ›๏ธ Inference Configuration

Parameter Value
Temperature 0.7
Top-k 50
Top-p 0.92
Repetition penalty 1.3

๐Ÿ“Š Quantitative Performance

Metric Base AraGPT-2 SlangGPT
chrF 10.62 29.08
BLEU 0.02 6.63
chrF Improvement โ€” +18.46 (+173%)

Metric Notes

  • chrF measures character n-gram overlap.
  • BLEU measures word n-gram precision.

๐Ÿš€ Usage

1. Install Dependencies

pip install transformers torch

2. Load Model and Tokenizer

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "AdhamAshraf/SlangGPT"

tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

model.eval()

3. Translation Function

def translate(egyptian_text):
    prompt = f"dialect: {egyptian_text.strip()} โ†” msa:"

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=64
    )

    inputs = {
        k: v.to(model.device)
        for k, v in inputs.items()
    }

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.92,
            repetition_penalty=1.3,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    full = tokenizer.decode(
        outputs[0],
        skip_special_tokens=True
    )

    if "msa:" in full:
        return full.split("msa:")[-1].strip()

    return full

4. Example Usage

print(translate("ูŠู„ุง ููŠู†ุŸ"))
# ู‡ูŠุงุŒ ุฃูŠู† ุฃู†ุชุŸ

print(translate("ุฅู†ุช ุฑุงูŠุญ ููŠู†ุŸ"))
# ุฃูŠู† ุฃู†ุช ุฐุงู‡ุจุŸ

print(translate("ุนุงูŠุฒ ุงูƒู„"))
# ุฃุฑูŠุฏ ุงู„ุทุนุงู…

๐ŸŒ Interactive Web App

Try the live demo here:

https://huggingface.co/spaces/AdhamAshraf/SlangGPT

The Space allows users to:

  • Translate Egyptian Arabic to MSA
  • Submit feedback
  • Rate translation quality
  • Help improve future versions of SlangGPT

๐Ÿ“Š Training Dataset

SlangGPT was fine-tuned using:

AdhamAshraf/egyptian-2-arabic

Dataset statistics:

Property Value
Total samples 18,250
Format Parallel Egyptian โ†” MSA
Train split 80%
Validation split 10%
Test split 10%

Preprocessing Steps

  • Diacritic removal
  • Punctuation normalization
  • English text filtering

The dataset was derived from the original Egyptian-English corpus by Abdalrahmankamel, with English translations replaced by curated MSA equivalents.


๐Ÿงช Evaluation & Feedback

The model was evaluated using:

  • chrF
  • BLEU

User feedback collected through the Gradio Space is publicly stored in:

https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset

This feedback dataset supports:

  • RLHF research
  • Translation verification
  • Reward model training
  • Error analysis

๐Ÿ“œ License

This project is released under the MIT License.

Free for academic and commercial use with attribution.


๐Ÿ™ Acknowledgements

  • AraGPT-2 by Antoun et al. (2021)
  • Stanford CS224N framework and educational materials
  • The Arabic NLP open-source community

๐Ÿ“š Citation

@software{slanggpt2026,
  author = {Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry},
  title = {SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic Dialect-to-MSA Translation},
  year = {2026},
  url = {https://github.com/adhamashraf7788/SlangGPT}
}

@dataset{egyptian_2_arabic,
  author = {Adham Ashraf and Abdelrahman Ahmed and Ahmed Fekry},
  title = {Egyptian Arabic Slang to Formal Arabic Dataset},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic}
}

โ“ Questions & Issues

For bugs, issues, or feature requests:

https://github.com/adhamashraf7788/SlangGPT/issues

Downloads last month
95
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AdhamAshraf/SlangGPT

Finetuned
(6)
this model

Datasets used to train AdhamAshraf/SlangGPT

Space using AdhamAshraf/SlangGPT 1