Ideogram 4.0 — fused INT8 (Transformer Lab)

A fused INT8 GEMM kernel for the Ideogram 4.0 diffusion transformer that makes the INT8 W8A8 build run on the GPU's native INT8 tensor cores. On a single RTX 3090 it generates a 1024px image in 156.5 s — faster than the published FP8 (172.9 s, needs two GPUs) and NF4 (164.5 s) variants — at INT8's FP8-ceiling quality.

What's in this repo: the fused INT8 kernel (triton_int8_gemm.py) + the loader that installs it (fused_int8.py), plus usage.py / download_deps.py. It does not carry the weights: it loads the INT8 W8A8 weights from transformerlab/ideogram-4-int8-w8a8 and the text encoder + VAE + inference code from the gated base repo ideogram-ai/ideogram-4-fp8.

Why this one

The INT8 W8A8 build holds the FP8 quality ceiling and beats NF4, but the standard quantized forward path quantizes then dequantizes activations and weights back to bf16 and runs a bf16 matmul — the INT8 tensor cores are never used, so "INT8" ends up slower than FP8/NF4 on consumer Ampere (which has no FP8 tensor cores). This kernel fixes that: it runs the matmul as int8 × int8 → int32 on Ampere mma.s8 units and folds the per-token (activation) × per-channel (weight) dequantization and bias into the GEMM epilogue, so a quantized linear is a single fused kernel launch. The result is that INT8 becomes the fastest variant on a 3090, and 1024px generation fits on a single 24 GB card (FP8/BF16 need two).

Numbers (RTX 3090, 1024px, 48 steps)

Variant	s/image	GPUs
Fused INT8 (this kernel)	156.5	1
NF4 (published)	164.5	1
FP8 (published)	172.9	2
INT8 W8A8 without the fused kernel	184–185	2

Per-GEMM, the fused kernel is ~2.8–4.2× faster than bf16 and bit-exact in its integer accumulation against torch._int_mm. Quality matches the unfused INT8 build on PickScore / CLIPScore (point estimates). Latencies are single-run measurements; small margins (e.g. vs NF4) are within unquantified run-to-run variance.

Scope: the speedup is specific to consumer Ampere (RTX 3090 — fast native INT8, no fast FP8/bf16 alternative). On datacenter cards with fast native bf16/FP8 (A100, B200) a plain bf16 matmul is faster, so use the fused kernel where there is no fast native low-precision matmul. The kernel is autotuned for sm_86; retuning is needed elsewhere.

How to run

# 1) one-time: install ideogram4 + triton, fetch base components + the INT8 weights
#    (needs your own access to the gated repos ideogram-ai/ideogram-4-fp8 and
#     transformerlab/ideogram-4-int8-w8a8)
python download_deps.py

# 2) generate (single 24 GB Ampere card, e.g. RTX 3090). Ideogram 4 expects a
#    structured JSON caption, NOT a raw string (see "Prompt format" below):
python usage.py '{"high_level_description":"A graphic-design poster with the word \"HELLO\" in large bold lettering, centered on a solid background.","compositional_deconstruction":{"background":"A flat solid-color poster background with even, neutral studio lighting.","elements":[{"type":"text","bbox":[380,250,620,780],"text":"HELLO","desc":"the word HELLO in large bold sans-serif uppercase, centered, high contrast against the background"}]}}'

Prompt format

Ideogram 4 is trained on structured JSON captions and validates each prompt against a schema before generation; raw natural-language strings render in-image text poorly. Pass a schema-valid caption — a high_level_description plus a compositional_deconstruction (a scene-shell background and an elements list). Put any in-image text in a text element carrying the verbatim string:

{
  "high_level_description": "A graphic-design poster with the word 'HELLO' in large bold lettering, centered on a solid background.",
  "compositional_deconstruction": {
    "background": "A flat solid-color poster background with even, neutral studio lighting.",
    "elements": [
      {
        "type": "text",
        "bbox": [380, 250, 620, 780],
        "text": "HELLO",
        "desc": "the word HELLO in large bold sans-serif uppercase, centered, high contrast against the background"
      }
    ]
  }
}

Files here:

triton_int8_gemm.py — the fused INT8 GEMM (autotuned per shape).
fused_int8.py — loads the INT8 W8A8 weights and installs the fused kernel on the DiT linears.
usage.py, download_deps.py — a minimal end-to-end example + setup.

Reference implementation: the kernel math is validated (bit-exact integer accumulation), but verify end to end on your stack before production use. Requires an Ampere GPU with INT8 tensor cores and triton.

License

Derived from Ideogram 4.0 under its non-commercial, research-only license. The INT8 weights inherit the same terms; this repo is distributed for research use only.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for transformerlab/ideogram-4-int8-fused

Base model

ideogram-ai/ideogram-4-fp8

Finetuned

(4)

this model