Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
multimodalartΒ 
posted an update Mar 5
Post
The Stable Diffusion 3 research paper broken down, including some overlooked details! πŸ“

Model
πŸ“ 2 base model variants mentioned: 2B and 8B sizes

πŸ“ New architecture in all abstraction levels:
- πŸ”½ UNet; ⬆️ Multimodal Diffusion Transformer, bye cross attention πŸ‘‹
- πŸ†• Rectified flows for the diffusion process
- 🧩 Still a Latent Diffusion Model

πŸ“„ 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness

πŸ—ƒοΈ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)

Variants
πŸ” A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics
✏️ An Instruct Edit 2B model was trained, and learned how to do text-replacement

Results
βœ… State of the art in automated evals for composition and prompt understanding
βœ… Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)

Paper: https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

Thanks for the breakdown.

...Agreed, thank you!!!

Interesting, it seems like the novel things they added to SD 3 are really just:

  1. Changing the scheduling (~ linear flow)
  2. Sample more in the middle of the time range
  3. (novel) MM-DiT, which just splits the post-attention activation into 2 MLPs (one for text, one for image, though they both self-attend jointly as they're concatenated together for attention)
  4. Combines 3 text embeddings together depending on the complexity of the prompt (could a decision-router model be used to determine how many embeddings to use?)

I honestly thought there'd be more, though the move to a DiT is a big one