vedkdev
/

text-diffusion-en-de

+---
+tags:
+- text-diffusion
+- machine-translation
+- en-de
+- masked-diffusion
+- from-scratch
+language:
+- en
+- de
+datasets:
+- wmt/wmt14
+license: apache-2.0
+---
+# Text Diffusion Model for EN→DE Translation
+A **masked discrete diffusion** model for English-to-German machine translation, trained from scratch on WMT14 EN-DE.
+## Architecture
+| Component | Detail |
+|---|---|
+| **Type** | Masked Discrete Diffusion |
+| **Backbone** | DiT (Diffusion Transformer) with adaLN |
+| **Parameters** | ~72M |
+| **Blocks** | 12 DiT blocks |
+| **Hidden dim** | 512, 8 attention heads |
+| **Attention** | Bidirectional (no causal mask) with RoPE |
+| **Conditioning** | Timestep via sinusoidal embeddings + adaLN; Segment embeddings for src/tgt |
+| **Weight tying** | Input embeddings tied to output projection |
+| **Tokenizer** | [Helsinki-NLP/opus-mt-en-de](https://huggingface.co/Helsinki-NLP/opus-mt-en-de) (~58K vocab) |
+| **Max sequence** | 128 src + 128 tgt tokens |
+### Inspired by
+- **[MDLM](https://arxiv.org/abs/2406.07524)** — DiT backbone architecture, masked diffusion objective
+- **[LLaDA](https://arxiv.org/abs/2502.09992)** — Conditional generation via SFT (keep prompt unmasked, mask only target), 1/t ELBO weighting
+- **[DiNoiSer](https://arxiv.org/abs/2302.10025)** — Noise manipulation for conditional seq2seq diffusion
+## How It Works
+### Training (Forward Diffusion)
+1. Source (EN) and target (DE) tokens are concatenated: `[source | target]`
+2. A random masking rate `t ~ Uniform(0, 1)` is sampled per example
+3. Each target token is independently masked with probability `t`
+4. The bidirectional DiT predicts all masked tokens simultaneously
+5. Loss = cross-entropy on masked positions only, weighted by `1/t` (continuous-time ELBO)
+### Inference (Reverse Diffusion)
+1. Start with source tokens + fully masked target: `[source | MASK MASK ... MASK]`
+2. Over 50 denoising steps, iteratively predict and unmask tokens
+3. At each step `t → s`: predict all masked tokens, randomly re-mask a fraction `s/t`
+4. Final step: all remaining masks are filled with predictions
+## Training Details
+| Setting | Value |
+|---|---|
+| **Dataset** | WMT14 EN-DE (~4.5M parallel sentence pairs) |
+| **Optimizer** | AdamW (lr=3e-4, β₁=0.9, β₂=0.98, wd=0.01) |
+| **Schedule** | Cosine with 4K linear warmup |
+| **Effective batch size** | 256 (64 × 4 gradient accumulation) |
+| **Max steps** | 200,000 |
+| **Mixed precision** | FP16 |
+| **Gradient clipping** | max_norm=1.0 |
+| **Evaluation** | SacreBLEU on WMT14 test set every 20K steps |
+## Quick Start
+### Install dependencies
+```bash
+pip install torch transformers datasets trackio sacrebleu sacremoses sentencepiece protobuf
+```
+### Train
+```bash
+git clone https://huggingface.co/vedkdev/text-diffusion-en-de
+cd text-diffusion-en-de
+python train.py
+```
+The script will:
+- Download WMT14 EN-DE automatically
+- Train for 200K steps with logging via [Trackio](https://huggingface.co/docs/trackio)
+- Evaluate SacreBLEU periodically
+- Push checkpoints to this repo
+### Adjusting for your hardware
+Edit the `TRAIN_CONFIG` dict in `train.py`:
+| GPU VRAM | Recommended `batch_size` | `gradient_accumulation_steps` |
+|---|---|---|
+| 24GB (A10G/3090/4090) | 64 | 4 |
+| 16GB (T4/V100) | 32 | 8 |
+| 12GB (3060) | 16 | 16 |
+| 8GB (3070) | 8 | 32 |
+### Inference (after training)
+```python
+import torch, json
+from train import DiffusionTranslator, DiffusionTranslatorConfig, generate
+from transformers import AutoTokenizer
+# Load checkpoint
+config = DiffusionTranslatorConfig(**json.load(open("checkpoints/best/config.json")))
+model = DiffusionTranslator(config)
+model.load_state_dict(torch.load("checkpoints/best/model.pt", map_location="cpu"))
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained("checkpoints/best/")
+# Translate
+text = "The weather is nice today."
+src = tokenizer(f"translate English to German: {text}",
+                max_length=128, truncation=True, padding="max_length",
+                return_tensors="pt")
+gen_ids = generate(model, src["input_ids"], torch.zeros_like(src["input_ids"]),
+                   config, num_steps=50, device="cpu")
+print(tokenizer.decode(gen_ids[0], skip_special_tokens=True))
+```
+## Expected Results
+Based on published literature for similar architectures on WMT14 EN→DE:
+| Model | BLEU | Reference |
+|---|---|---|
+| Autoregressive Transformer | ~27 | Vaswani et al. |
+| DiNoiSer (continuous diffusion) | 24.6 | Ye et al. 2023 |
+| SeqDiffuSeq | 19.8 | Yuan et al. 2022 |
+| E2D2 (discrete diffusion) | 24.8 | Kuleshov et al. 2024 |
+| **This model (target)** | **15-20** | ~72M params, no KD |
+> Note: Text diffusion models typically score 2-5 BLEU below autoregressive transformers of similar size. Knowledge distillation (KD) from an AR teacher can close the gap by ~1-2 BLEU.
+## Citation
+If you use this model, please cite the foundational papers:
+```bibtex
+@article{sahoo2024mdlm,
+  title={Simple and Effective Masked Diffusion Language Models},
+  author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Kuleshov, Volodymyr},
+  journal={NeurIPS},
+  year={2024}
+}
+@article{nie2025llada,
+  title={Large Language Diffusion Models},
+  author={Nie, Shen and Zhu, Fengqi and You, Chao and Zhang, Xiaojun and Ou, Zhenguo and Zhu, Jun},
+  journal={arXiv preprint arXiv:2502.09992},
+  year={2025}
+}
+@article{ye2023dinoiser,
+  title={DiNoiSer: Diffused Conditional Sequence Learning by Manipulating Noises},
+  author={Ye, Jiasheng and Zheng, Zaixiang and Bao, Yu and Qian, Lihua and Gu, Quanquan},
+  journal={ACL},
+  year={2023}
+}
+```