FLUX.2 klein-4B — 1-step text-to-image (RDM distilled)

A single-step text-to-image generator distilled from the 4-step FLUX.2 klein-4B teacher via Representation Distribution Matching (RDM) — a multi-encoder Nyström-MMD distribution-matching objective over a curated teacher reference. One forward pass at 512² (≈0.15–0.3 s/image), no iterative sampling.

This 1-step student matches or exceeds its 4-step teacher on all three eval axes (standard-mmdet GenEval composition + PickScore human-preference proxy).

💻 Code · 📖 Paper · 🌐 Project Page

Results (checkpoint = step 180)

Standard-mmdet GenEval (553 prompts, avg over 6 tasks) + PickScore (raw logit_scale·cos), vs the teacher:

model GenEval ↑ PickScore COCO-val ↑ PickScore Pick-a-Pic ↑
naive klein @ 1 step (no distillation) ~0.42 19.95 20.11
4-step FLUX.2 klein teacher 0.7944 22.576 21.848
this — 1-step RDM (s180) 0.8258 22.755 21.817

Distillation lifts the 1-step model from a broken floor (PickScore-cv 19.95) to above the 4-step teacher on GenEval (+3.1 pp) and PickScore-COCO-val (+0.18), and to parity on Pick-a-Pic — i.e. it recovers the entire 4-step→1-step quality gap and then surpasses the teacher on composition and COCO-val preference.

GenEval per-task @ s180 (%): single_object 99.4 · two_object 92.4 · colors 92.3 · counting 75.6 · color_attr 70.8 · position 65.0.

Files

file what
model.safetensors the generator weights, bfloat16 (~8 GB, the model's native inference dtype). Keys are the FLUX.2 klein DiT tensors, prefixed model. (the adapter's DiT submodule).
flux2_klein_1step_rdm_geallcoco_s180.pth the raw training checkpoint (fp32; dict with key model; model_ema/optimizer are None, EMA disabled). For exact reproduction.
config.json minimal metadata (arch, params, resolution, dtype).

Usage

The generator replaces the DiT weights of the FLUX.2 klein pipeline; it reuses klein's VAE and the Qwen3 text encoder. Inference = encode prompt → one DiT forward on Gaussian latent noise → VAE decode.

import torch
from safetensors.torch import load_file

# 1) weights (bf16); keys are prefixed "model." (the DiT submodule of the training adapter)
sd = load_file("model.safetensors")
sd = {k[len("model."):]: v for k, v in sd.items() if k.startswith("model.")}  # -> bare klein DiT keys

# 2) load into a FLUX.2 klein DiT instance (from the klein pipeline), then run ONE step:
#    dit.load_state_dict(sd)                          # 3.876B params, bf16
#    ctx  = qwen3_text_encoder(prompt)                # ctx_len 48
#    z    = torch.randn(B, C, 64, 64)                 # 512^2 latent
#    v    = dit(z, ctx, t=1.0)                         # single velocity prediction
#    x0   = z - v                                      # one Euler step (t: 1 -> 0)
#    img  = klein_vae.decode(x0)

This is research code; the reference training/inference stack (the Flux2AdapterModel wrapper, the 1-step sampler, and prompt→Qwen3-ctx preprocessing) is the FD-Loss / EPFL-VITA pipeline. The .pth and model.safetensors contain identical weights (bf16 cast for the latter).

Method (brief)

From-scratch klein-init full finetune of the 4B DiT. Loss = per-encoder self-normalized ∇log(MMD²) across 10 frozen vision encoders (inception, convnext, mae, clip, dinov3, pe-core, siglip2, aimv2, webssl-dino, dreamsim), computed with Nyström landmarks (M=8192) against a curated teacher reference (GenEval-correctness-filtered teacher samples + PickScore-top-3 COCO teacher samples). On-policy rollout buffer R=10240, cold kernel bandwidth σ = median·0.25, GradCache for exact full-batch MMD gradients. ~90 GPU-hours to step 180 (8× H200).

Caveats

  • GenEval is partly in-distribution. The 553 GenEval prompts appear in the training generator pool (≈17.6%) and the reference (≈17.8%). The GenEval number is strong but partly reflects in-distribution fit; held-out compositional generalization (e.g. T2I-CompBench) is not yet measured.
  • Trained/evaluated at 512².

Citation

This model is from the paper Representation Distribution Matching for One-Step Visual Generation. Paper: arXiv:2607.02375 · Hugging Face Papers · Project page

@article{feng2026rdm,
  title={Representation Distribution Matching for One-Step Visual Generation},
  author={Feng, Lan and Li, Wuyang and Zablocki, {\'E}loi and Cord, Matthieu and Alahi, Alexandre},
  journal={arXiv preprint arXiv:2607.02375},
  year={2026}
}

License

Derived from FLUX.2 klein (Apache-2.0); released under Apache-2.0.

Downloads last month
68
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using epfl-vita/flux2-klein-1step-rdm 1

Paper for epfl-vita/flux2-klein-1step-rdm