SlangGPT: Egyptian Arabic → Modern Standard Arabic (MSA)

SlangGPT is a fine-tuned AraGPT-2-medium model that translates Egyptian Arabic slang/dialect into Modern Standard Arabic (MSA).

It is part of the broader SlangGPT project — an end-to-end Arabic NLP system for dialect translation and translation verification.

📄 Project Resources

Paper:
https://github.com/adhamashraf7788/SlangGPT/blob/main/report/SlangGPT_report.pdf
Main Dataset:
https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic
Feedback Dataset:
https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset
GitHub Repository:
https://github.com/adhamashraf7788/SlangGPT
Interactive Demo (Hugging Face Space):
https://huggingface.co/spaces/AdhamAshraf/SlangGPT

🧠 Model Description

SlangGPT is a decoder-only causal language model built on top of:

Base model: aubmindlab/aragpt2-medium

The model was fine-tuned on Egyptian Arabic ↔ MSA parallel text using conditional autoregressive training.

Prompt Format

dialect: {input} ↔ msa:

The model generates the Modern Standard Arabic translation autoregressively.

✨ Key Features

Input: Egyptian Arabic slang/dialect
Output: Modern Standard Arabic (MSA)
Architecture: GPT-2 style decoder-only transformer
Tokenizer: BPE tokenizer with 64k vocabulary
Context length: 1024 tokens
Language: Arabic

⚙️ Training Configuration

Parameter	Value
Batch size	8 (effective 32)
Learning rate	5e-5
Scheduler	Cosine
Warmup	10%
Gradient clipping	1.0

🎛️ Inference Configuration

Parameter	Value
Temperature	0.7
Top-k	50
Top-p	0.92
Repetition penalty	1.3

📊 Quantitative Performance

Metric	Base AraGPT-2	SlangGPT
chrF	10.62	29.08
BLEU	0.02	6.63
chrF Improvement	—	+18.46 (+173%)

Metric Notes

chrF measures character n-gram overlap.
BLEU measures word n-gram precision.

🚀 Usage

1. Install Dependencies

pip install transformers torch

2. Load Model and Tokenizer

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "AdhamAshraf/SlangGPT"

tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

model.eval()

3. Translation Function

def translate(egyptian_text):
    prompt = f"dialect: {egyptian_text.strip()} ↔ msa:"

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=64
    )

    inputs = {
        k: v.to(model.device)
        for k, v in inputs.items()
    }

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.92,
            repetition_penalty=1.3,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    full = tokenizer.decode(
        outputs[0],
        skip_special_tokens=True
    )

    if "msa:" in full:
        return full.split("msa:")[-1].strip()

    return full

4. Example Usage

print(translate("يلا فين؟"))
# هيا، أين أنت؟

print(translate("إنت رايح فين؟"))
# أين أنت ذاهب؟

print(translate("عايز اكل"))
# أريد الطعام

🌐 Interactive Web App

Try the live demo here:

https://huggingface.co/spaces/AdhamAshraf/SlangGPT

The Space allows users to:

Translate Egyptian Arabic to MSA
Submit feedback
Rate translation quality
Help improve future versions of SlangGPT

📊 Training Dataset

SlangGPT was fine-tuned using:

AdhamAshraf/egyptian-2-arabic

Dataset statistics:

Property	Value
Total samples	18,250
Format	Parallel Egyptian ↔ MSA
Train split	80%
Validation split	10%
Test split	10%

Preprocessing Steps

Diacritic removal
Punctuation normalization
English text filtering

The dataset was derived from the original Egyptian-English corpus by Abdalrahmankamel, with English translations replaced by curated MSA equivalents.

🧪 Evaluation & Feedback

The model was evaluated using:

chrF
BLEU

User feedback collected through the Gradio Space is publicly stored in:

https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset

This feedback dataset supports:

RLHF research
Translation verification
Reward model training
Error analysis

📜 License

This project is released under the MIT License.

Free for academic and commercial use with attribution.

🙏 Acknowledgements

AraGPT-2 by Antoun et al. (2021)
Stanford CS224N framework and educational materials
The Arabic NLP open-source community

📚 Citation

@software{slanggpt2026,
  author = {Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry},
  title = {SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic Dialect-to-MSA Translation},
  year = {2026},
  url = {https://github.com/adhamashraf7788/SlangGPT}
}

@dataset{egyptian_2_arabic,
  author = {Adham Ashraf and Abdelrahman Ahmed and Ahmed Fekry},
  title = {Egyptian Arabic Slang to Formal Arabic Dataset},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic}
}

❓ Questions & Issues

For bugs, issues, or feature requests:

https://github.com/adhamashraf7788/SlangGPT/issues

Downloads last month: 95

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for AdhamAshraf/SlangGPT

Base model

aubmindlab/aragpt2-medium

Finetuned

(6)

this model

AdhamAshraf
/

SlangGPT