Modality Forcing for Scalable Spatial Generation
Paper β’ 2606.13676 β’ Published
Joint text β RGB + depth generation with a single diffusion transformer, built on FLUX.2. Modality Forcing assigns separate noise levels per modality during post-training, so one model supports joint generation (text β RGB-D), image-to-depth, and depth-to-image at inference.
| File | Description |
|---|---|
model.safetensors |
FluxRGBD DiT (12B total β 9B-class FLUX.2 backbone + depth streams, bf16) |
config.json |
Model variant config (flux_rgbd_9b_v2) |
ae_encoder.safetensors / ae_decoder.safetensors |
FLUX.2 autoencoder |
The Qwen3-8B text encoder is pulled separately from
Qwen/Qwen3-8B.
git clone https://github.com/Duisterhof/modality-forcing.git
cd modality-forcing
bash install.sh
python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets"
The scripts download these weights automatically (bartduis/modality_forcing
is the default --model).
The model weights are released under CC BY-NC 4.0 (non-commercial). The inference code is Apache-2.0; see the GitHub repository.
@article{duisterhof2026mofo,
title = {Modality Forcing for Scalable Spatial Generation},
author = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
journal = {arXiv preprint arXiv:2606.13676},
year = {2026}
}