qubitron commited on
Commit
cf308b2
·
verified ·
1 Parent(s): 34792eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -3
README.md CHANGED
@@ -1,3 +1,146 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - GSAI-ML/LLaDA-8B-Instruct
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - diffusion-language-model
10
+ - quantization
11
+ library_name: transformers
12
+ ---
13
+ # LLaDA-8B-Quantized
14
+
15
+ **World's first INT8 and INT4 weight-only quantization for [LLaDA](https://github.com/ML-GSAI/LLaDA) — a masked diffusion large language model trained from scratch at 8B scale.**
16
+
17
+ > Code & full documentation: [github.com/qubitronlabsdev/llada-quantization](https://github.com/qubitronlabsdev/llada-quantization)
18
+
19
+ ---
20
+
21
+ ## Model Description
22
+
23
+ LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens **in parallel** via iterative masked denoising — unlike autoregressive models (GPT, LLaMA) that generate one token at a time.
24
+
25
+ This repository provides two post-training quantized variants of `GSAI-ML/LLaDA-8B-Instruct`:
26
+
27
+ | File | Quantization | Size | Memory Saved | Speed (A100) |
28
+ |---|---|---|---|---|
29
+ | `llada_int8_quantized.pt` | INT8 per-row | 8.54 GB | **47%** | **9.64 tok/s** |
30
+ | `llada_int4_quantized.pt` | INT4 packed | 5.82 GB | **64%** | 3.39 tok/s |
31
+
32
+ Original model (bfloat16): 16.13 GB
33
+
34
+ ---
35
+
36
+ ## How It Works
37
+
38
+ All `nn.Linear` layers are replaced with custom quantized layers:
39
+
40
+ - **INT8** — weights scaled per-row to `[-127, 127]` integers. Scale factors stored in float32. ~1 byte per weight.
41
+ - **INT4** — weights scaled per-row to `[-8, 7]` integers. Two values packed per byte (uint8). ~0.5 bytes per weight.
42
+
43
+ Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic.
44
+
45
+ ---
46
+
47
+ ## Usage
48
+
49
+ ### Installation
50
+
51
+ ```bash
52
+ git clone https://github.com/qubitronlabsdev/llada-quantization
53
+ cd llada-quantization
54
+ pip install -r requirements.txt
55
+ ```
56
+
57
+ ### Load and Generate
58
+
59
+ ```python
60
+ from inference import load_quantized, generate
61
+ from transformers import AutoTokenizer
62
+
63
+ tokenizer = AutoTokenizer.from_pretrained(
64
+ "GSAI-ML/LLaDA-8B-Instruct",
65
+ trust_remote_code=True
66
+ )
67
+
68
+ # Download weights from this repo first, then:
69
+
70
+ # INT8
71
+ model = load_quantized(
72
+ "llada_int8_quantized.pt",
73
+ mode="int8",
74
+ device="cuda"
75
+ )
76
+
77
+ # INT4
78
+ model = load_quantized(
79
+ "llada_int4_quantized.pt",
80
+ mode="int4",
81
+ device="cuda"
82
+ )
83
+
84
+ output = generate(model, tokenizer, "What is machine learning?")
85
+ print(output)
86
+ ```
87
+
88
+ ### Quantize from Scratch
89
+
90
+ ```python
91
+ from quantize import run_and_save
92
+
93
+ run_and_save(mode="int8", save_path="llada_int8_quantized.pt")
94
+ run_and_save(mode="int4", save_path="llada_int4_quantized.pt")
95
+ ```
96
+
97
+ ---
98
+
99
+ ## Hardware Requirements
100
+
101
+ | Variant | Min VRAM | Recommended |
102
+ |---|---|---|
103
+ | INT8 | 12 GB | A100 / H100 |
104
+ | INT4 | 8 GB | RTX 3090 / A100 |
105
+
106
+ Tested on: NVIDIA A100 80GB, NVIDIA H100
107
+
108
+ ---
109
+
110
+ ## Limitations
111
+
112
+ - INT4 introduces slightly more quantization error than INT8
113
+ - Generation speed depends on sequence length and number of diffusion steps
114
+ - English only (inherited from base model)
115
+
116
+ ---
117
+
118
+ ## Citation
119
+
120
+ If you use this work, please cite:
121
+
122
+ ```bibtex
123
+ @misc{llada-quantization-2026,
124
+ title = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models},
125
+ author = {Dhiraj Choudhary},
126
+ year = {2026},
127
+ url = {https://github.com/qubitronlabsdev/llada-quantization}
128
+ }
129
+ ```
130
+
131
+ Original LLaDA paper:
132
+
133
+ ```bibtex
134
+ @article{nie2025large,
135
+ title = {Large Language Diffusion Models},
136
+ author = {Nie, Shen and others},
137
+ year = {2025},
138
+ url = {https://arxiv.org/abs/2502.09992}
139
+ }
140
+ ```
141
+
142
+ ---
143
+
144
+ ## License
145
+
146
+ Apache 2.0 — same as the original LLaDA model.