@multimodalart on Hugging Face: "The Stable Diffusion 3 research paper broken down, including some overlooked…"

multimodalart

posted an update Mar 5

Post

The Stable Diffusion 3 research paper broken down, including some overlooked details! 📝

Model
📏 2 base model variants mentioned: 2B and 8B sizes

📐 New architecture in all abstraction levels:
- 🔽 UNet; ⬆️ Multimodal Diffusion Transformer, bye cross attention 👋
- 🆕 Rectified flows for the diffusion process
- 🧩 Still a Latent Diffusion Model

📄 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness

🗃️ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)

Variants
🔁 A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics
✏️ An Instruct Edit 2B model was trained, and learned how to do text-replacement

Results
✅ State of the art in automated evals for composition and prompt understanding
✅ Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)

Paper: https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

WaveCut

Mar 5

Thanks for the breakdown.

MoveScores18

Mar 7

...Agreed, thank you!!!

leegao19

Mar 8

Interesting, it seems like the novel things they added to SD 3 are really just:

Changing the scheduling (~ linear flow)
Sample more in the middle of the time range
(novel) MM-DiT, which just splits the post-attention activation into 2 MLPs (one for text, one for image, though they both self-attend jointly as they're concatenated together for attention)
Combines 3 text embeddings together depending on the complexity of the prompt (could a decision-router model be used to determine how many embeddings to use?)

I honestly thought there'd be more, though the move to a DiT is a big one

Join the conversation