MM-FM: Flow Matching for Multimodal Distributions

Flow Matching for Multimodal Distributions
Gaoxiang Luo*, Frank Cole*, Sihang Zhang, Yuxiang Wan, Yulong Lu, Ju Sun
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

[Project Page] [Paper] [Code] [BibTeX]

MM-FM replaces flow matching's Gaussian source with a Gaussian Mixture Model trained on encoder latents, coupling each image to source noise from its own GMM mode and optionally conditioning the model on the mode. This repository hosts all artifacts for the code release: checkpoints, trained GMMs, pretrained RAE decoders, normalization statistics, and the ImageNet-256 FID reference batch.

Usage

git clone --recursive https://github.com/GaoxiangLuo/MM-FM.git && cd MM-FM
uv sync
uv run hf download luo00042/mm-fm --local-dir artifacts   # this repository (~160 GB)

# generate 50K images with the mode-conditional + GMM model (8 GPUs)
uv run torchrun --standalone --nnodes=1 --nproc_per_node=8 \
  src/sample_ddp.py \
  --config configs/stage2/sampling/ImageNet256/DiTDH-XL_DINOv2-B-MODE-CONDITIONAL-GMM-8192-DIAG.yaml \
  --sample-dir results/samples/mode-cond-gmm-8192-diag \
  --num-fid-samples 50000 --precision bf16 --per-proc-batch-size 4

See the README for GMM training, FM training, and FID evaluation instructions.

Repository Layout

Path	Contents
`checkpoints/<encoder>/`	flow-matching DiT checkpoints (`uncond-gmm` / `mode-gmm`; `25k` = 20 epochs, `100k` = 80 epochs)
`checkpoints/autoguidance/`	small DiT-S models used as the autoguidance guide
`gmm/<encoder>/`	trained CLS + spatial GMMs (8192 components, diagonal)
`decoders/<encoder>/`	pretrained RAE decoders (ViT-XL), from RAE
`normalization_stats/<encoder>/`	latent normalization statistics
`fid_reference/`	`VIRTUAL_imagenet256_labeled.npz`, the official ImageNet-256 FID reference batch (mirrored from guided-diffusion)

Results

FID-50K on ImageNet-256, reproduced end-to-end with the released code and these artifacts (50-step Euler ODE, bf16 sampling). AG = autoguidance with the DiT-S guides, scale 1.5.

Encoder	Setting	Epochs	AG	FID	Checkpoint
DINOv2-B	GMM (uncond)	20	--	4.84	`checkpoints/dinov2-b/uncond-gmm-25k.pt`
DINOv2-B	GMM (uncond)	20	✓	4.03	same + `autoguidance/dit-s-uncond-gmm-25k.pt`
DINOv2-B	GMM (uncond)	80	--	3.84	`checkpoints/dinov2-b/uncond-gmm-100k.pt`
DINOv2-B	GMM (uncond)	80	✓	3.17	same + `autoguidance/dit-s-uncond-gmm-25k.pt`
DINOv2-B	GMM + Mode	20	--	4.77	`checkpoints/dinov2-b/mode-gmm-25k.pt`
DINOv2-B	GMM + Mode	20	✓	4.07	same + `autoguidance/dit-s-mode-gmm-25k.pt`
DINOv2-B	GMM + Mode	80	--	3.20	`checkpoints/dinov2-b/mode-gmm-100k.pt`
DINOv2-B	GMM + Mode	80	✓	2.78	same + `autoguidance/dit-s-mode-gmm-25k.pt`
SigLIP2-B	GMM (uncond)	20	--	8.23	`checkpoints/siglip2-b/uncond-gmm-25k.pt`
SigLIP2-B	GMM + Mode	20	--	7.25	`checkpoints/siglip2-b/mode-gmm-25k.pt`
MAE-B	GMM (uncond)	20	--	17.03	`checkpoints/mae-b/uncond-gmm-25k.pt`
MAE-B	GMM + Mode	20	--	16.24	`checkpoints/mae-b/mode-gmm-25k.pt`

Citing MM-FM

@InProceedings{Luo_2026_CVPR,
  author    = {Luo, Gaoxiang and Cole, Frank and Zhang, Sihang and Wan, Yuxiang and Lu, Yulong and Sun, Ju},
  title     = {Flow Matching for Multimodal Distributions},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {23260-23271}
}

Acknowledgments

MM-FM builds directly on RAE: the pretrained encoders and decoders (including decoders/ here) are RAE's, used as-is; the flow-matching models are trained in their latent space. The FID reference batch comes from guided-diffusion.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Unconditional Image Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support