MM-FM: Flow Matching for Multimodal Distributions

GitHub PyTorch License: MIT

Flow Matching for Multimodal Distributions
Gaoxiang Luo*, Frank Cole*, Sihang Zhang, Yuxiang Wan, Yulong Lu, Ju Sun
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

[Project Page] [Paper] [Code] [BibTeX]

MM-FM replaces flow matching's Gaussian source with a Gaussian Mixture Model trained on encoder latents, coupling each image to source noise from its own GMM mode and optionally conditioning the model on the mode. This repository hosts all artifacts for the code release: checkpoints, trained GMMs, pretrained RAE decoders, normalization statistics, and the ImageNet-256 FID reference batch.

Usage

git clone --recursive https://github.com/GaoxiangLuo/MM-FM.git && cd MM-FM
uv sync
uv run hf download luo00042/mm-fm --local-dir artifacts   # this repository (~160 GB)

# generate 50K images with the mode-conditional + GMM model (8 GPUs)
uv run torchrun --standalone --nnodes=1 --nproc_per_node=8 \
  src/sample_ddp.py \
  --config configs/stage2/sampling/ImageNet256/DiTDH-XL_DINOv2-B-MODE-CONDITIONAL-GMM-8192-DIAG.yaml \
  --sample-dir results/samples/mode-cond-gmm-8192-diag \
  --num-fid-samples 50000 --precision bf16 --per-proc-batch-size 4

See the README for GMM training, FM training, and FID evaluation instructions.

Repository Layout

Path Contents
checkpoints/<encoder>/ flow-matching DiT checkpoints (uncond-gmm / mode-gmm; 25k = 20 epochs, 100k = 80 epochs)
checkpoints/autoguidance/ small DiT-S models used as the autoguidance guide
gmm/<encoder>/ trained CLS + spatial GMMs (8192 components, diagonal)
decoders/<encoder>/ pretrained RAE decoders (ViT-XL), from RAE
normalization_stats/<encoder>/ latent normalization statistics
fid_reference/ VIRTUAL_imagenet256_labeled.npz, the official ImageNet-256 FID reference batch (mirrored from guided-diffusion)

Results

FID-50K on ImageNet-256, reproduced end-to-end with the released code and these artifacts (50-step Euler ODE, bf16 sampling). AG = autoguidance with the DiT-S guides, scale 1.5.

Encoder Setting Epochs AG FID Checkpoint
DINOv2-B GMM (uncond) 20 -- 4.84 checkpoints/dinov2-b/uncond-gmm-25k.pt
DINOv2-B GMM (uncond) 20 โœ“ 4.03 same + autoguidance/dit-s-uncond-gmm-25k.pt
DINOv2-B GMM (uncond) 80 -- 3.84 checkpoints/dinov2-b/uncond-gmm-100k.pt
DINOv2-B GMM (uncond) 80 โœ“ 3.17 same + autoguidance/dit-s-uncond-gmm-25k.pt
DINOv2-B GMM + Mode 20 -- 4.77 checkpoints/dinov2-b/mode-gmm-25k.pt
DINOv2-B GMM + Mode 20 โœ“ 4.07 same + autoguidance/dit-s-mode-gmm-25k.pt
DINOv2-B GMM + Mode 80 -- 3.20 checkpoints/dinov2-b/mode-gmm-100k.pt
DINOv2-B GMM + Mode 80 โœ“ 2.78 same + autoguidance/dit-s-mode-gmm-25k.pt
SigLIP2-B GMM (uncond) 20 -- 8.23 checkpoints/siglip2-b/uncond-gmm-25k.pt
SigLIP2-B GMM + Mode 20 -- 7.25 checkpoints/siglip2-b/mode-gmm-25k.pt
MAE-B GMM (uncond) 20 -- 17.03 checkpoints/mae-b/uncond-gmm-25k.pt
MAE-B GMM + Mode 20 -- 16.24 checkpoints/mae-b/mode-gmm-25k.pt

Citing MM-FM

@InProceedings{Luo_2026_CVPR,
  author    = {Luo, Gaoxiang and Cole, Frank and Zhang, Sihang and Wan, Yuxiang and Lu, Yulong and Sun, Ju},
  title     = {Flow Matching for Multimodal Distributions},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {23260-23271}
}

Acknowledgments

MM-FM builds directly on RAE: the pretrained encoders and decoders (including decoders/ here) are RAE's, used as-is; the flow-matching models are trained in their latent space. The FID reference batch comes from guided-diffusion.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support