Modality Forcing for Scalable Spatial Generation

Joint text → RGB + depth generation with a single diffusion transformer, built on FLUX.2. Modality Forcing assigns separate noise levels per modality during post-training, so one model supports joint generation (text → RGB-D), image-to-depth, and depth-to-image at inference.

📄 Paper: arXiv:2606.13676
💻 Code: github.com/Duisterhof/modality-forcing
🚀 Demo: Hugging Face Space
🌐 Project page: modality-forcing.github.io

Files

File	Description
`model.safetensors`	FluxRGBD DiT (12B total — 9B-class FLUX.2 backbone + depth streams, bf16)
`config.json`	Model variant config (`flux_rgbd_9b_v2`)
`ae_encoder.safetensors` / `ae_decoder.safetensors`	FLUX.2 autoencoder

The Qwen3-8B text encoder is pulled separately from Qwen/Qwen3-8B.

Usage

git clone https://github.com/Duisterhof/modality-forcing.git
cd modality-forcing
bash install.sh
python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets"

The scripts download these weights automatically (bartduis/modality_forcing is the default --model).

License

The model weights are released under CC BY-NC 4.0 (non-commercial). The inference code is Apache-2.0; see the GitHub repository.

Citation

@article{duisterhof2026mofo,
  title   = {Modality Forcing for Scalable Spatial Generation},
  author  = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
  journal = {arXiv preprint arXiv:2606.13676},
  year    = {2026}
}

Downloads last month: 100

Safetensors

Model size

12B params

Tensor type

BF16

Space using bartduis/modality_forcing 1

Paper for bartduis/modality_forcing

Modality Forcing for Scalable Spatial Generation

Paper • 2606.13676 • Published 2 days ago