Modality Forcing for Scalable Spatial Generation

Joint text β†’ RGB + depth generation with a single diffusion transformer, built on FLUX.2. Modality Forcing assigns separate noise levels per modality during post-training, so one model supports joint generation (text β†’ RGB-D), image-to-depth, and depth-to-image at inference.

Files

File Description
model.safetensors FluxRGBD DiT (12B total β€” 9B-class FLUX.2 backbone + depth streams, bf16)
config.json Model variant config (flux_rgbd_9b_v2)
ae_encoder.safetensors / ae_decoder.safetensors FLUX.2 autoencoder

The Qwen3-8B text encoder is pulled separately from Qwen/Qwen3-8B.

Usage

git clone https://github.com/Duisterhof/modality-forcing.git
cd modality-forcing
bash install.sh
python scripts/joint.py --prompt "a cozy sunlit kitchen with wooden cabinets"

The scripts download these weights automatically (bartduis/modality_forcing is the default --model).

License

The model weights are released under CC BY-NC 4.0 (non-commercial). The inference code is Apache-2.0; see the GitHub repository.

Citation

@article{duisterhof2026mofo,
  title   = {Modality Forcing for Scalable Spatial Generation},
  author  = {Duisterhof, Bardienus Pieter and Ramanan, Deva and Ichnowski, Jeffrey and Johnson, Justin and Park, Keunhong},
  journal = {arXiv preprint arXiv:2606.13676},
  year    = {2026}
}
Downloads last month
100
Safetensors
Model size
12B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using bartduis/modality_forcing 1

Paper for bartduis/modality_forcing