AraFusion-AR-381M
AraFusion-AR-381M is a 381M-parameter autoregressive causal language model for Arabic, trained on 14.1 billion tokens from FineWeb-2 Arabic subsets. It serves as the autoregressive baseline for the AraFusion project, using the same architecture size, tokenizer, and training data as AraFusion-MDLM-381M to enable direct comparison between autoregressive and masked diffusion language modelling for Arabic.
Training Curves
Training Loss
Validation Loss
Learning Rate Schedule
Model Details
| Attribute | Value |
|---|---|
| Architecture | Autoregressive Causal Transformer |
| Parameters | 381M |
| Layers | 24 |
| Hidden size | 1024 |
| Attention heads | 16 |
| FFN dimension | 4096 |
| Max sequence length | 512 |
| Vocabulary | 76,800 (MorphBPE) |
| Position encoding | Learned |
| Weight tying | Yes (embedding โ LM head) |
| Precision | BFloat16 |
| Framework | PyTorch 2.x + CUDA 13.0 |
Key Architecture Decisions
- Causal (upper-triangular) mask: Applied at every self-attention layer, enforcing strict left-to-right generation. Implemented via
nn.TransformerEncoderLayerwith an additiveโโmask over future positions โ a decoder without cross-attention. - Pre-LayerNorm (norm_first=True): LayerNorm applied before the attention and FFN sub-layers, improving training stability at large scale.
- No padding mask at inference: Training data is pre-packed to fixed 512-token sequences โ no padding exists in the training data, so SDPA can take its fastest kernel path.
- Weight-tied LM head:
lm_head.weight = token_embedding.weight, reducing parameters by ~79M while improving generalisation. - Top-p nucleus sampling: Generation uses nucleus sampling with configurable temperature and
top_pfor quality vs. diversity tradeoff.
Relationship to AraFusion-MDLM-381M
This model is an architectural twin of AraFusion-MDLM-381M, sharing the same:
- Layer count, hidden size, FFN dimension, and vocabulary (24L / 1024d / 16H / FFN=4096 / 76.8K vocab)
- Tokenizer (MorphBPE,
AraFus/AraFusion-MDLM-381M) - Training data source and preprocessing pipeline
- Optimizer, learning rate, and warmup schedule
The only difference is the training objective: causal next-token prediction (AR) vs. absorbing-state masked diffusion (MDLM). This makes the two models a controlled baseline comparison.
Training Data
Source
FineWeb-2 Arabic subsets, covering three dialect groups:
| Dialect | ISO Code | Natural Distribution | Training Distribution |
|---|---|---|---|
| Modern Standard Arabic (MSA) | arb_Arab |
~98% | 90% |
| Najdi Arabic | ars_Arab |
~0.8% | 5% |
| Egyptian Arabic | arz_Arab |
~1.2% | 5% |
Dialect balancing: Power=0.38 inverse-frequency resampling gently boosts minority dialects from ~1% to 5% while keeping MSA dominant, avoiding memorisation from extreme oversampling.
Data Processing Pipeline
- Download: FineWeb-2 Arabic subsets via HuggingFace Datasets
- Tokenize: MorphBPE tokenizer (76,800 vocab, morpheme-boundary-aware)
- Pack: Sequences packed to exactly 512 tokens (no padding)
- Shard: Split into 453 training shards + 10 validation shards
- Convert: Parquet โ numpy memmap for fast random-access (one-time conversion)
Data Statistics
| Split | Shards | Sequences | Tokens | Size on Disk |
|---|---|---|---|---|
| Train | 453 | 109,782,285 | 56,208,529,920 (56.2B) | 237 GB |
| Validation | 10 | 2,423,450 | 1,240,806,400 (1.24B) | 4.9 GB |
| Total | 463 | 112,205,735 | 57,449,336,320 (57.4B) | 242 GB |
Token & Segment Calculations
| Metric | Value |
|---|---|
| Tokens per sequence | 512 |
| Sequences per GPU per micro-step | 48 |
| GPUs | 3 |
| Gradient accumulation steps | 3 |
| Global batch (per micro-step) | 144 sequences / 73,728 tokens |
| Effective global batch (per optimizer step) | 432 sequences / 221,184 tokens |
| Total micro-batch steps | 191,000 |
| Total optimizer steps | 63,667 |
| Total tokens seen | 14,081,648,000 (~14.1B) |
| Epochs over training data | ~0.25 |
| Chinchilla-optimal tokens (381M params) | ~7.62B |
| Over-training factor | 1.85x Chinchilla |
Note on token count: The training dataset contains 56.2B tokens, but this run completed 0.25 epochs. The 191,000 step count matches AraFusion-MDLM-381M (enabling a fair step-for-step comparison), but fewer tokens are seen per step due to using 3 GPUs (vs. 4 for MDLM) with gradient accumulation=3.
Training Configuration
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 3e-4 (peak) |
| LR schedule | Cosine decay to ~0 |
| Warmup steps | 2,000 |
| Total micro-batch steps | 191,000 |
| Total optimizer steps | 63,667 |
| Batch size (per GPU) | 48 |
| Global batch size (effective) | 432 sequences |
| Gradient accumulation | 3 |
| Gradient clipping | 1.0 |
| AdamW betas | (0.9, 0.95) |
| Weight decay | 0.1 |
| Precision | BFloat16 (FSDP MixedPrecision) |
| Multi-GPU strategy | FSDP NO_SHARD (DDP equivalent) |
| Dropout | 0.1 |
| Seed | 42 |
Hardware
| Resource | Specification |
|---|---|
| GPUs | 3ร NVIDIA H200 (140 GB HBM3e each) |
| Total GPU memory | ~421 GB |
| CPU | 96 cores / 192 logical (2ร Xeon) |
| RAM | ~2.0 TB |
| Storage | Local /scratch SSD (225 GB data copied from NFS at job start) |
| SLURM cluster | crirdchpxd004, partition gpu-H200 |
Performance
| Metric | Value |
|---|---|
| Micro-step time | 0.286 s |
| Optimizer step time | 0.859 s |
| Throughput (per GPU, rank 0) | 86,197 tokens/sec |
| Throughput (total, 3 GPUs) | ~258,600 tokens/sec |
| MFU (dense BF16) | ~20% |
| Total training time | ~15.2 hours |
Note on MFU: The ~20% MFU (vs. 67% for AraFusion-MDLM-381M) reflects that the AR training does not use
torch.compileand computes logits for all token positions (MDLM only computes logits at masked positions, ~50% of the sequence). These are known optimisation opportunities for future runs.
Training Loss
| Milestone | Step | Train Loss | Val Loss | Tokens Seen |
|---|---|---|---|---|
| Final | 191,000 | 2.747 | 2.610 | 14.1B |
The AR training loss is standard cross-entropy (next-token prediction), directly comparable to other autoregressive Arabic LMs. The MDLM training loss uses a noise-weighted diffusion objective and is not numerically comparable to this value.
Tokenizer
This model uses MorphBPE, a morphology-aware BPE tokenizer designed for Arabic. It applies morphological segmentation before sub-word tokenization, ensuring Arabic morpheme boundaries (prefixes like ุงูุ ุจุงูุ ูุงู and suffixes like ููุ ููุ ุงุชุ ูุง) are respected.
| Property | Value |
|---|---|
| Vocabulary size | 76,800 |
| Pad token ID | 5 |
| BOS token ID | 1 |
| EOS token ID | 2 |
| Mask token ID | 62 (kept for MDLM interface parity, unused during AR) |
| Load with | AutoTokenizer.from_pretrained("AraFus/AraFusion-AR-381M", subfolder="base_tokenizer") |
Quick Start
pip install torch safetensors transformers huggingface_hub
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"AraFus/AraFusion-AR-381M",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"AraFus/AraFusion-AR-381M",
subfolder="base_tokenizer",
)
model.eval()
Autoregressive Generation (Top-p Nucleus Sampling)
The model generates left-to-right using nucleus sampling. Unlike MDLM, generation is sequential โ each token is conditioned on all previously generated tokens.
import torch
device = next(model.parameters()).device
prompt = "ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู"
prompt_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)
generated = model.model.generate(
prompt_ids,
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(generated[0].tolist(), skip_special_tokens=True))
Using the HuggingFace generate Interface
from transformers import GenerationConfig
gen_config = GenerationConfig(
do_sample=True,
temperature=0.8,
top_p=0.9,
max_new_tokens=128,
eos_token_id=2,
pad_token_id=5,
)
inputs = tokenizer(
"ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู",
return_tensors="pt",
add_special_tokens=False,
).to(device)
output_ids = model.generate(**inputs, generation_config=gen_config)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Batch Inference
texts = [
"ุงููุบุฉ ุงูุนุฑุจูุฉ ูู",
"ุงููุงูุฑุฉ ู
ุฏููุฉ",
"ุงูุชุนูู
ุงูุขูู ูุณุชุฎุฏู
",
]
inputs = tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=64,
add_special_tokens=False,
).to(device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=64,
do_sample=True,
temperature=0.8,
top_p=0.9,
)
for ids in output_ids:
print(tokenizer.decode(ids, skip_special_tokens=True))
print("---")
Perplexity Evaluation
import torch
import torch.nn.functional as F
def compute_perplexity(model, tokenizer, texts, device, max_length=512):
model.eval()
total_loss = 0.0
total_tokens = 0
for text in texts:
ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
ids = ids[:, :max_length].to(device)
with torch.no_grad():
logits = model(ids).logits # [1, L, V]
shift_logits = logits[:, :-1, :]
shift_labels = ids[:, 1:]
loss = F.cross_entropy(
shift_logits.reshape(-1, shift_logits.size(-1)),
shift_labels.reshape(-1),
reduction="sum",
)
total_loss += loss.item()
total_tokens += shift_labels.numel()
return torch.exp(torch.tensor(total_loss / total_tokens)).item()
ppl = compute_perplexity(model, tokenizer, ["ุงููุต ุงูุนุฑุจู ุงูู
ุฑุงุฏ ุชูููู
ู ููุง"], device)
print(f"Perplexity: {ppl:.2f}")
Files
| File | Description |
|---|---|
model.safetensors |
Model weights (~1.84 GB, safetensors format) |
config.json |
HuggingFace model configuration (ARAr381MConfig parameters) |
configuration_ar.py |
HF PretrainedConfig subclass โ required for trust_remote_code=True loading |
modeling_ar.py |
HF PreTrainedModel subclass (ARAr381MForCausalLM) โ required for trust_remote_code=True loading |
ar_baseline.py |
Raw PyTorch model class (ARBaseline, ARConfig) โ standalone, no HF dependency |
morphbpe.py |
MorphBPE tokenizer class (morpheme-boundary-aware Arabic BPE) |
morphbpe_config.json |
MorphBPE tokenizer configuration |
base_tokenizer/ |
Underlying HuggingFace BPE tokenizer files |
training_metadata.json |
Training provenance: step, losses, hardware, upload timestamp |
training_loss.png |
Training loss curve (WandB export) |
validation_loss.png |
Validation loss curve (WandB export) |
training_validation_loss.png |
Combined training & validation loss |
learning_rate.png |
Learning rate schedule |
Citation
@misc{arafusion2026,
title={AraFusion: Arabic Masked Diffusion Language Model with Classifier-Free Guidance},
year={2026},
}
License
Apache 2.0
- Downloads last month
- 1,181


