MIRO (main)

_{Qualitative samples from the released MIRO checkpoint — same gallery as the
teaser of the project page.}

Main MIRO checkpoint. Trained jointly on all seven reward signals (CLIP, aesthetic, ImageReward, PickScore, HPSv2, VQAScore, SciScore) with a 50/50 mix of original and synthetic captions.

This checkpoint accompanies the paper MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency (Dufour, Degeorge, Ghosh, Kalogeiton, Picard — ICML 2026).


Paper	https://arxiv.org/abs/2510.25897
Project page	https://nicolas-dufour.github.io/miro/
Code	https://github.com/nicolas-dufour/miro
Parameters	360.4M
Resolution	256×256 (SDXL VAE latent space)
Architecture	RIN flow-matching backbone, FLAN-T5-XL text conditioning
Training data	CC12M + LAION Aesthetics v2 4.5 (6.0+ aesthetic subset)
Reward signals	`clip_score`, `aesthetic_score`, `image_reward_score`, `pick_a_score_score`, `hpsv2_score`, `vqa_score`, `sciscore_score`
Weights	`model.safetensors`, fp32 (EMA master weights — ready for finetuning)

Install

pip install miro-t2i

miro-t2i is the public PyPI package; it imports as import miro. The first call to MiroPipeline.from_pretrained(...) will additionally fetch google/flan-t5-xl (text encoder) and stabilityai/sdxl-vae (latent decoder) from the Hub.

Usage

import torch
from miro import MiroPipeline

pipe = MiroPipeline.from_pretrained("nicolas-dufour/miro")
pipe = pipe.to("cuda", torch.float16)

prompt = (
    "Photography closeup portrait of an adorable rusty brokendown steampunk "
    "robot covered in budding vegetation, surrounded by tall grass, misty "
    "futuristic scifi forest environment."
)
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.0)[0]
image.save("out.png")

Reward conditioning

MIRO conditions the flow model on a vector of reward targets in addition to the text prompt. By default every reward is requested at its maximum (1.0); you can override individual axes to bias generation toward a particular trade-off:

image = pipe(
    prompt,                       # the rusty-robot prompt from above
    reward_targets={
        "clip_score": 1.0,        # strict prompt alignment
        "aesthetic_score": 0.3,   # de-prioritise prettiness
        "image_reward_score": 1.0,  # prioritise general human preference
        # any reward not listed defaults to 1.0
    },
    negative_reward_targets={
        # zeros by default; what to push the unconditional branch toward
    },
    guidance_scale=7.0,
)[0]

The seven reward dimensions are:

Reward	Normalised range	What it measures
`clip_score`	~[0, 1]	CLIP text–image alignment
`aesthetic_score`	~[0, 1]	LAION aesthetic-quality predictor
`image_reward_score`	~[0, 1]	ImageReward (general preference model)
`pick_a_score_score`	~[0, 1]	PickScore (human preference)
`hpsv2_score`	~[0, 1]	HPSv2 (human preference v2)
`vqa_score`	~[0, 1]	VQAScore (compositional faithfulness)
`sciscore_score`	~[0, 1]	SciScore (scientific-image plausibility)

Reported benchmarks

The paper reports the following headline numbers for the main MIRO model (this repo's nicolas-dufour/miro):

Metric	MIRO (350M)	FLUX-dev (12B)
GenEval (overall)	75 (with inference-time reward tuning) / 68 (default)	67
Inference compute	1×	~370×
Aesthetic-metric convergence vs. baseline pretraining	19× faster	—

Per-variant scores (GenEval, FID, individual reward scores) for the eight ablations are reported in the paper's ablation tables. Please refer to arXiv:2510.25897 for the full breakdown.

Training compute and data

Default hardware: 2 nodes × 8 H100 GPUs (16× H100, 16-mixed precision)
Optimiser: LAMB, lr 1e-3 (5k warmup → cosine decay), weight decay 1e-2
Batch size: 1024 globally (64 per GPU on 16× H100), gradient-clip 2.0
Steps: 500 k (≈ ~29 epochs over the enriched training set)
Wall-clock on 16× H100: ~52 hours (≈ 2.65 train it/s sustained)
8-GPU fallback: 1 node × 8 H100 with trainer.accumulate_grad_batches=2, measured at ≈ 1.45 train it/s → ~~96 hours (~~4 days) end-to-end. Requires trainer.strategy.static_graph=false and trainer.strategy.find_unused_parameters=true to play well with the self-conditioning skip in the loss; both flags are set automatically by miro/slurm/launch_multicad_synth_8gpu.py.
Data: CC12M + LAION Aesthetics v2 4.5 filtered to aesthetic_score >= 6.0 (the higher-quality subset), encoded to SDXL VAE latents at 256 resolution. Each sample is paired with seven reward scores and FLAN-T5-XL embeddings of both the original and a synthetic caption, computed by miro/data/preprocess_data.py.

Limitations and intended use

This checkpoint is a research artifact released to reproduce and build on the MIRO paper. Known limitations:

Resolution: 256×256 only. Higher-resolution outputs require upscaling.
Domain: trained on web-scraped image–caption pairs (CC12M + LAION Aesthetics 6.0). Inherits the biases of those datasets — including under-representation of many cultures, languages, and concepts, and the presence of stereotypes. Generations may reflect or amplify these biases.
Reward-model biases: the seven reward predictors used during training encode their own biases (e.g. aesthetic and human-preference models reflect the taste of their annotator pools). Conditioning on these rewards inherits and can sharpen those biases.
Not for safety-critical use: outputs are not factual and the SciScore reward does not guarantee scientific accuracy.
No safety filter is shipped with the model; users deploying it in user-facing settings should add their own.

The model is released under the MIT license; the SDXL VAE and FLAN-T5-XL encoder it depends on at inference time are loaded from stabilityai/sdxl-vae and google/flan-t5-xl and are subject to their respective licenses.

Citation

@inproceedings{dufour2026miro,
  title     = {{MIRO}: {M}ult{I}-{R}eward c{O}nditioned pretraining improves {T2I} quality and efficiency},
  author    = {Dufour, Nicolas and Degeorge, Lucas and Ghosh, Arijit and Kalogeiton, Vicky and Picard, David},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}

License

MIT — see https://github.com/nicolas-dufour/miro/blob/main/LICENSE.

Downloads last month: 40

Safetensors

Model size

0.4B params

Tensor type

F32

BOOL

Paper for nicolas-dufour/miro

MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Paper • 2510.25897 • Published Oct 29, 2025 • 17