MM-FM: Flow Matching for Multimodal Distributions
Flow Matching for Multimodal Distributions
Gaoxiang Luo*, Frank Cole*, Sihang Zhang, Yuxiang Wan, Yulong Lu, Ju Sun
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026
[Project Page] [Paper] [Code] [BibTeX]
MM-FM replaces flow matching's Gaussian source with a Gaussian Mixture Model trained on encoder latents, coupling each image to source noise from its own GMM mode and optionally conditioning the model on the mode. This repository hosts all artifacts for the code release: checkpoints, trained GMMs, pretrained RAE decoders, normalization statistics, and the ImageNet-256 FID reference batch.
Usage
git clone --recursive https://github.com/GaoxiangLuo/MM-FM.git && cd MM-FM
uv sync
uv run hf download luo00042/mm-fm --local-dir artifacts # this repository (~160 GB)
# generate 50K images with the mode-conditional + GMM model (8 GPUs)
uv run torchrun --standalone --nnodes=1 --nproc_per_node=8 \
src/sample_ddp.py \
--config configs/stage2/sampling/ImageNet256/DiTDH-XL_DINOv2-B-MODE-CONDITIONAL-GMM-8192-DIAG.yaml \
--sample-dir results/samples/mode-cond-gmm-8192-diag \
--num-fid-samples 50000 --precision bf16 --per-proc-batch-size 4
See the README for GMM training, FM training, and FID evaluation instructions.
Repository Layout
| Path | Contents |
|---|---|
checkpoints/<encoder>/ |
flow-matching DiT checkpoints (uncond-gmm / mode-gmm; 25k = 20 epochs, 100k = 80 epochs) |
checkpoints/autoguidance/ |
small DiT-S models used as the autoguidance guide |
gmm/<encoder>/ |
trained CLS + spatial GMMs (8192 components, diagonal) |
decoders/<encoder>/ |
pretrained RAE decoders (ViT-XL), from RAE |
normalization_stats/<encoder>/ |
latent normalization statistics |
fid_reference/ |
VIRTUAL_imagenet256_labeled.npz, the official ImageNet-256 FID reference batch (mirrored from guided-diffusion) |
Results
FID-50K on ImageNet-256, reproduced end-to-end with the released code and these artifacts (50-step Euler ODE, bf16 sampling). AG = autoguidance with the DiT-S guides, scale 1.5.
| Encoder | Setting | Epochs | AG | FID | Checkpoint |
|---|---|---|---|---|---|
| DINOv2-B | GMM (uncond) | 20 | -- | 4.84 | checkpoints/dinov2-b/uncond-gmm-25k.pt |
| DINOv2-B | GMM (uncond) | 20 | โ | 4.03 | same + autoguidance/dit-s-uncond-gmm-25k.pt |
| DINOv2-B | GMM (uncond) | 80 | -- | 3.84 | checkpoints/dinov2-b/uncond-gmm-100k.pt |
| DINOv2-B | GMM (uncond) | 80 | โ | 3.17 | same + autoguidance/dit-s-uncond-gmm-25k.pt |
| DINOv2-B | GMM + Mode | 20 | -- | 4.77 | checkpoints/dinov2-b/mode-gmm-25k.pt |
| DINOv2-B | GMM + Mode | 20 | โ | 4.07 | same + autoguidance/dit-s-mode-gmm-25k.pt |
| DINOv2-B | GMM + Mode | 80 | -- | 3.20 | checkpoints/dinov2-b/mode-gmm-100k.pt |
| DINOv2-B | GMM + Mode | 80 | โ | 2.78 | same + autoguidance/dit-s-mode-gmm-25k.pt |
| SigLIP2-B | GMM (uncond) | 20 | -- | 8.23 | checkpoints/siglip2-b/uncond-gmm-25k.pt |
| SigLIP2-B | GMM + Mode | 20 | -- | 7.25 | checkpoints/siglip2-b/mode-gmm-25k.pt |
| MAE-B | GMM (uncond) | 20 | -- | 17.03 | checkpoints/mae-b/uncond-gmm-25k.pt |
| MAE-B | GMM + Mode | 20 | -- | 16.24 | checkpoints/mae-b/mode-gmm-25k.pt |
Citing MM-FM
@InProceedings{Luo_2026_CVPR,
author = {Luo, Gaoxiang and Cole, Frank and Zhang, Sihang and Wan, Yuxiang and Lu, Yulong and Sun, Ju},
title = {Flow Matching for Multimodal Distributions},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {23260-23271}
}
Acknowledgments
MM-FM builds directly on RAE: the
pretrained encoders and decoders (including decoders/ here) are RAE's, used
as-is; the flow-matching models are trained in their latent space. The FID
reference batch comes from
guided-diffusion.