qubitron
/

LLaDA-8B-Quantized

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- GSAI-ML/LLaDA-8B-Instruct
+pipeline_tag: text-generation
+tags:
+- diffusion-language-model
+- quantization
+library_name: transformers
+---
+# LLaDA-8B-Quantized
+**World's first INT8 and INT4 weight-only quantization for [LLaDA](https://github.com/ML-GSAI/LLaDA) — a masked diffusion large language model trained from scratch at 8B scale.**
+> Code & full documentation: [github.com/qubitronlabsdev/llada-quantization](https://github.com/qubitronlabsdev/llada-quantization)
+---
+## Model Description
+LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens **in parallel** via iterative masked denoising — unlike autoregressive models (GPT, LLaMA) that generate one token at a time.
+This repository provides two post-training quantized variants of `GSAI-ML/LLaDA-8B-Instruct`:
+| File | Quantization | Size | Memory Saved | Speed (A100) |
+|---|---|---|---|---|
+| `llada_int8_quantized.pt` | INT8 per-row | 8.54 GB | **47%** | **9.64 tok/s** |
+| `llada_int4_quantized.pt` | INT4 packed | 5.82 GB | **64%** | 3.39 tok/s |
+Original model (bfloat16): 16.13 GB
+---
+## How It Works
+All `nn.Linear` layers are replaced with custom quantized layers:
+- **INT8** — weights scaled per-row to `[-127, 127]` integers. Scale factors stored in float32. ~1 byte per weight.
+- **INT4** — weights scaled per-row to `[-8, 7]` integers. Two values packed per byte (uint8). ~0.5 bytes per weight.
+Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic.
+---
+## Usage
+### Installation
+```bash
+git clone https://github.com/qubitronlabsdev/llada-quantization
+cd llada-quantization
+pip install -r requirements.txt
+```
+### Load and Generate
+```python
+from inference import load_quantized, generate
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "GSAI-ML/LLaDA-8B-Instruct",
+    trust_remote_code=True
+)
+# Download weights from this repo first, then:
+# INT8
+model = load_quantized(
+    "llada_int8_quantized.pt",
+    mode="int8",
+    device="cuda"
+)
+# INT4
+model = load_quantized(
+    "llada_int4_quantized.pt",
+    mode="int4",
+    device="cuda"
+)
+output = generate(model, tokenizer, "What is machine learning?")
+print(output)
+```
+### Quantize from Scratch
+```python
+from quantize import run_and_save
+run_and_save(mode="int8", save_path="llada_int8_quantized.pt")
+run_and_save(mode="int4", save_path="llada_int4_quantized.pt")
+```
+---
+## Hardware Requirements
+| Variant | Min VRAM | Recommended |
+|---|---|---|
+| INT8 | 12 GB | A100 / H100 |
+| INT4 | 8 GB | RTX 3090 / A100 |
+Tested on: NVIDIA A100 80GB, NVIDIA H100
+---
+## Limitations
+- INT4 introduces slightly more quantization error than INT8
+- Generation speed depends on sequence length and number of diffusion steps
+- English only (inherited from base model)
+---
+## Citation
+If you use this work, please cite:
+```bibtex
+@misc{llada-quantization-2026,
+  title  = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models},
+  author = {Dhiraj Choudhary},
+  year   = {2026},
+  url    = {https://github.com/qubitronlabsdev/llada-quantization}
+}
+```
+Original LLaDA paper:
+```bibtex
+@article{nie2025large,
+  title  = {Large Language Diffusion Models},
+  author = {Nie, Shen and others},
+  year   = {2025},
+  url    = {https://arxiv.org/abs/2502.09992}
+}
+```
+---
+## License
+Apache 2.0 — same as the original LLaDA model.