AraFusion-AR-381M

AraFusion-AR-381M is a 381M-parameter autoregressive causal language model for Arabic, trained on 14.1 billion tokens from FineWeb-2 Arabic subsets. It serves as the autoregressive baseline for the AraFusion project, using the same architecture size, tokenizer, and training data as AraFusion-MDLM-381M to enable direct comparison between autoregressive and masked diffusion language modelling for Arabic.

Training Curves

Training Loss

Training Loss

Validation Loss

Validation Loss

Learning Rate Schedule

Learning Rate Schedule

Model Details

Attribute Value
Architecture Autoregressive Causal Transformer
Parameters 381M
Layers 24
Hidden size 1024
Attention heads 16
FFN dimension 4096
Max sequence length 512
Vocabulary 76,800 (MorphBPE)
Position encoding Learned
Weight tying Yes (embedding โ†” LM head)
Precision BFloat16
Framework PyTorch 2.x + CUDA 13.0

Key Architecture Decisions

  • Causal (upper-triangular) mask: Applied at every self-attention layer, enforcing strict left-to-right generation. Implemented via nn.TransformerEncoderLayer with an additive โˆ’โˆž mask over future positions โ€” a decoder without cross-attention.
  • Pre-LayerNorm (norm_first=True): LayerNorm applied before the attention and FFN sub-layers, improving training stability at large scale.
  • No padding mask at inference: Training data is pre-packed to fixed 512-token sequences โ€” no padding exists in the training data, so SDPA can take its fastest kernel path.
  • Weight-tied LM head: lm_head.weight = token_embedding.weight, reducing parameters by ~79M while improving generalisation.
  • Top-p nucleus sampling: Generation uses nucleus sampling with configurable temperature and top_p for quality vs. diversity tradeoff.

Relationship to AraFusion-MDLM-381M

This model is an architectural twin of AraFusion-MDLM-381M, sharing the same:

  • Layer count, hidden size, FFN dimension, and vocabulary (24L / 1024d / 16H / FFN=4096 / 76.8K vocab)
  • Tokenizer (MorphBPE, AraFus/AraFusion-MDLM-381M)
  • Training data source and preprocessing pipeline
  • Optimizer, learning rate, and warmup schedule

The only difference is the training objective: causal next-token prediction (AR) vs. absorbing-state masked diffusion (MDLM). This makes the two models a controlled baseline comparison.

Training Data

Source

FineWeb-2 Arabic subsets, covering three dialect groups:

Dialect ISO Code Natural Distribution Training Distribution
Modern Standard Arabic (MSA) arb_Arab ~98% 90%
Najdi Arabic ars_Arab ~0.8% 5%
Egyptian Arabic arz_Arab ~1.2% 5%

Dialect balancing: Power=0.38 inverse-frequency resampling gently boosts minority dialects from ~1% to 5% while keeping MSA dominant, avoiding memorisation from extreme oversampling.

Data Processing Pipeline

  1. Download: FineWeb-2 Arabic subsets via HuggingFace Datasets
  2. Tokenize: MorphBPE tokenizer (76,800 vocab, morpheme-boundary-aware)
  3. Pack: Sequences packed to exactly 512 tokens (no padding)
  4. Shard: Split into 453 training shards + 10 validation shards
  5. Convert: Parquet โ†’ numpy memmap for fast random-access (one-time conversion)

Data Statistics

Split Shards Sequences Tokens Size on Disk
Train 453 109,782,285 56,208,529,920 (56.2B) 237 GB
Validation 10 2,423,450 1,240,806,400 (1.24B) 4.9 GB
Total 463 112,205,735 57,449,336,320 (57.4B) 242 GB

Token & Segment Calculations

Metric Value
Tokens per sequence 512
Sequences per GPU per micro-step 48
GPUs 3
Gradient accumulation steps 3
Global batch (per micro-step) 144 sequences / 73,728 tokens
Effective global batch (per optimizer step) 432 sequences / 221,184 tokens
Total micro-batch steps 191,000
Total optimizer steps 63,667
Total tokens seen 14,081,648,000 (~14.1B)
Epochs over training data ~0.25
Chinchilla-optimal tokens (381M params) ~7.62B
Over-training factor 1.85x Chinchilla

Note on token count: The training dataset contains 56.2B tokens, but this run completed 0.25 epochs. The 191,000 step count matches AraFusion-MDLM-381M (enabling a fair step-for-step comparison), but fewer tokens are seen per step due to using 3 GPUs (vs. 4 for MDLM) with gradient accumulation=3.

Training Configuration

Hyperparameter Value
Optimizer AdamW
Learning rate 3e-4 (peak)
LR schedule Cosine decay to ~0
Warmup steps 2,000
Total micro-batch steps 191,000
Total optimizer steps 63,667
Batch size (per GPU) 48
Global batch size (effective) 432 sequences
Gradient accumulation 3
Gradient clipping 1.0
AdamW betas (0.9, 0.95)
Weight decay 0.1
Precision BFloat16 (FSDP MixedPrecision)
Multi-GPU strategy FSDP NO_SHARD (DDP equivalent)
Dropout 0.1
Seed 42

Hardware

Resource Specification
GPUs 3ร— NVIDIA H200 (140 GB HBM3e each)
Total GPU memory ~421 GB
CPU 96 cores / 192 logical (2ร— Xeon)
RAM ~2.0 TB
Storage Local /scratch SSD (225 GB data copied from NFS at job start)
SLURM cluster crirdchpxd004, partition gpu-H200

Performance

Metric Value
Micro-step time 0.286 s
Optimizer step time 0.859 s
Throughput (per GPU, rank 0) 86,197 tokens/sec
Throughput (total, 3 GPUs) ~258,600 tokens/sec
MFU (dense BF16) ~20%
Total training time ~15.2 hours

Note on MFU: The ~20% MFU (vs. 67% for AraFusion-MDLM-381M) reflects that the AR training does not use torch.compile and computes logits for all token positions (MDLM only computes logits at masked positions, ~50% of the sequence). These are known optimisation opportunities for future runs.

Training Loss

Milestone Step Train Loss Val Loss Tokens Seen
Final 191,000 2.747 2.610 14.1B

The AR training loss is standard cross-entropy (next-token prediction), directly comparable to other autoregressive Arabic LMs. The MDLM training loss uses a noise-weighted diffusion objective and is not numerically comparable to this value.

Tokenizer

This model uses MorphBPE, a morphology-aware BPE tokenizer designed for Arabic. It applies morphological segmentation before sub-word tokenization, ensuring Arabic morpheme boundaries (prefixes like ุงู„ุŒ ุจุงู„ุŒ ูˆุงู„ and suffixes like ูˆู†ุŒ ูŠู†ุŒ ุงุชุŒ ู‡ุง) are respected.

Property Value
Vocabulary size 76,800
Pad token ID 5
BOS token ID 1
EOS token ID 2
Mask token ID 62 (kept for MDLM interface parity, unused during AR)
Load with AutoTokenizer.from_pretrained("AraFus/AraFusion-AR-381M", subfolder="base_tokenizer")

Quick Start

pip install torch safetensors transformers huggingface_hub
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "AraFus/AraFusion-AR-381M",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "AraFus/AraFusion-AR-381M",
    subfolder="base_tokenizer",
)
model.eval()

Autoregressive Generation (Top-p Nucleus Sampling)

The model generates left-to-right using nucleus sampling. Unlike MDLM, generation is sequential โ€” each token is conditioned on all previously generated tokens.

import torch

device = next(model.parameters()).device

prompt = "ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ"
prompt_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)

generated = model.model.generate(
    prompt_ids,
    max_new_tokens=100,
    temperature=0.8,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(generated[0].tolist(), skip_special_tokens=True))

Using the HuggingFace generate Interface

from transformers import GenerationConfig

gen_config = GenerationConfig(
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    max_new_tokens=128,
    eos_token_id=2,
    pad_token_id=5,
)

inputs = tokenizer(
    "ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ",
    return_tensors="pt",
    add_special_tokens=False,
).to(device)

output_ids = model.generate(**inputs, generation_config=gen_config)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Batch Inference

texts = [
    "ุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ ู‡ูŠ",
    "ุงู„ู‚ุงู‡ุฑุฉ ู…ุฏูŠู†ุฉ",
    "ุงู„ุชุนู„ู… ุงู„ุขู„ูŠ ูŠุณุชุฎุฏู…",
]

inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=64,
    add_special_tokens=False,
).to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
    )

for ids in output_ids:
    print(tokenizer.decode(ids, skip_special_tokens=True))
    print("---")

Perplexity Evaluation

import torch
import torch.nn.functional as F

def compute_perplexity(model, tokenizer, texts, device, max_length=512):
    model.eval()
    total_loss = 0.0
    total_tokens = 0

    for text in texts:
        ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
        ids = ids[:, :max_length].to(device)

        with torch.no_grad():
            logits = model(ids).logits          # [1, L, V]
            shift_logits = logits[:, :-1, :]
            shift_labels = ids[:, 1:]
            loss = F.cross_entropy(
                shift_logits.reshape(-1, shift_logits.size(-1)),
                shift_labels.reshape(-1),
                reduction="sum",
            )
            total_loss += loss.item()
            total_tokens += shift_labels.numel()

    return torch.exp(torch.tensor(total_loss / total_tokens)).item()

ppl = compute_perplexity(model, tokenizer, ["ุงู„ู†ุต ุงู„ุนุฑุจูŠ ุงู„ู…ุฑุงุฏ ุชู‚ูŠูŠู…ู‡ ู‡ู†ุง"], device)
print(f"Perplexity: {ppl:.2f}")

Files

File Description
model.safetensors Model weights (~1.84 GB, safetensors format)
config.json HuggingFace model configuration (ARAr381MConfig parameters)
configuration_ar.py HF PretrainedConfig subclass โ€” required for trust_remote_code=True loading
modeling_ar.py HF PreTrainedModel subclass (ARAr381MForCausalLM) โ€” required for trust_remote_code=True loading
ar_baseline.py Raw PyTorch model class (ARBaseline, ARConfig) โ€” standalone, no HF dependency
morphbpe.py MorphBPE tokenizer class (morpheme-boundary-aware Arabic BPE)
morphbpe_config.json MorphBPE tokenizer configuration
base_tokenizer/ Underlying HuggingFace BPE tokenizer files
training_metadata.json Training provenance: step, losses, hardware, upload timestamp
training_loss.png Training loss curve (WandB export)
validation_loss.png Validation loss curve (WandB export)
training_validation_loss.png Combined training & validation loss
learning_rate.png Learning rate schedule

Citation

@misc{arafusion2026,
  title={AraFusion: Arabic Masked Diffusion Language Model with Classifier-Free Guidance},
  year={2026},
}

License

Apache 2.0

Downloads last month
1,181
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train AraFus/AraFusion-AR-381M-v2