Ideogram 4.0 β fused INT8 (Transformer Lab)
A fused INT8 GEMM kernel for the Ideogram 4.0 diffusion transformer that makes the INT8 W8A8 build run on the GPU's native INT8 tensor cores. On a single RTX 3090 it generates a 1024px image in 156.5 s β faster than the published FP8 (172.9 s, needs two GPUs) and NF4 (164.5 s) variants β at INT8's FP8-ceiling quality.
What's in this repo: the fused INT8 kernel (
triton_int8_gemm.py) + the loader that installs it (fused_int8.py), plususage.py/download_deps.py. It does not carry the weights: it loads the INT8 W8A8 weights fromtransformerlab/ideogram-4-int8-w8a8and the text encoder + VAE + inference code from the gated base repoideogram-ai/ideogram-4-fp8.
Why this one
The INT8 W8A8 build holds the FP8 quality ceiling and beats NF4, but the standard quantized
forward path quantizes then dequantizes activations and weights back to bf16 and runs a
bf16 matmul β the INT8 tensor cores are never used, so "INT8" ends up slower than FP8/NF4
on consumer Ampere (which has no FP8 tensor cores). This kernel fixes that: it runs the
matmul as int8 Γ int8 β int32 on Ampere mma.s8 units and folds the per-token (activation)
Γ per-channel (weight) dequantization and bias into the GEMM epilogue, so a quantized linear
is a single fused kernel launch. The result is that INT8 becomes the fastest variant on a
3090, and 1024px generation fits on a single 24 GB card (FP8/BF16 need two).
Numbers (RTX 3090, 1024px, 48 steps)
| Variant | s/image | GPUs |
|---|---|---|
| Fused INT8 (this kernel) | 156.5 | 1 |
| NF4 (published) | 164.5 | 1 |
| FP8 (published) | 172.9 | 2 |
| INT8 W8A8 without the fused kernel | 184β185 | 2 |
Per-GEMM, the fused kernel is ~2.8β4.2Γ faster than bf16 and bit-exact in its integer
accumulation against torch._int_mm. Quality matches the unfused INT8 build on PickScore /
CLIPScore (point estimates). Latencies are single-run measurements; small margins (e.g. vs
NF4) are within unquantified run-to-run variance.
Scope: the speedup is specific to consumer Ampere (RTX 3090 β fast native INT8, no
fast FP8/bf16 alternative). On datacenter cards with fast native bf16/FP8 (A100, B200) a
plain bf16 matmul is faster, so use the fused kernel where there is no fast native
low-precision matmul. The kernel is autotuned for sm_86; retuning is needed elsewhere.
How to run
# 1) one-time: install ideogram4 + triton, fetch base components + the INT8 weights
# (needs your own access to the gated repos ideogram-ai/ideogram-4-fp8 and
# transformerlab/ideogram-4-int8-w8a8)
python download_deps.py
# 2) generate (single 24 GB Ampere card, e.g. RTX 3090). Ideogram 4 expects a
# structured JSON caption, NOT a raw string (see "Prompt format" below):
python usage.py '{"high_level_description":"A graphic-design poster with the word \"HELLO\" in large bold lettering, centered on a solid background.","compositional_deconstruction":{"background":"A flat solid-color poster background with even, neutral studio lighting.","elements":[{"type":"text","bbox":[380,250,620,780],"text":"HELLO","desc":"the word HELLO in large bold sans-serif uppercase, centered, high contrast against the background"}]}}'
Prompt format
Ideogram 4 is trained on structured JSON captions and validates each prompt against a
schema before generation; raw natural-language strings render in-image text poorly. Pass a
schema-valid caption β a high_level_description plus a compositional_deconstruction
(a scene-shell background and an elements list). Put any in-image text in a text
element carrying the verbatim string:
{
"high_level_description": "A graphic-design poster with the word 'HELLO' in large bold lettering, centered on a solid background.",
"compositional_deconstruction": {
"background": "A flat solid-color poster background with even, neutral studio lighting.",
"elements": [
{
"type": "text",
"bbox": [380, 250, 620, 780],
"text": "HELLO",
"desc": "the word HELLO in large bold sans-serif uppercase, centered, high contrast against the background"
}
]
}
}
Files here:
triton_int8_gemm.pyβ the fused INT8 GEMM (autotuned per shape).fused_int8.pyβ loads the INT8 W8A8 weights and installs the fused kernel on the DiT linears.usage.py,download_deps.pyβ a minimal end-to-end example + setup.
Reference implementation: the kernel math is validated (bit-exact integer accumulation), but verify end to end on your stack before production use. Requires an Ampere GPU with INT8 tensor cores and
triton.
License
Derived from Ideogram 4.0 under its non-commercial, research-only license. The INT8 weights inherit the same terms; this repo is distributed for research use only.
Model tree for transformerlab/ideogram-4-int8-fused
Base model
ideogram-ai/ideogram-4-fp8