MIRO (main)

Qualitative samples from MIRO

Qualitative samples from the released MIRO checkpoint — same gallery as the teaser of the project page.

Main MIRO checkpoint. Trained jointly on all seven reward signals (CLIP, aesthetic, ImageReward, PickScore, HPSv2, VQAScore, SciScore) with a 50/50 mix of original and synthetic captions.

This checkpoint accompanies the paper MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency (Dufour, Degeorge, Ghosh, Kalogeiton, Picard — ICML 2026).

Paper https://arxiv.org/abs/2510.25897
Project page https://nicolas-dufour.github.io/miro/
Code https://github.com/nicolas-dufour/miro
Parameters 360.4M
Resolution 256×256 (SDXL VAE latent space)
Architecture RIN flow-matching backbone, FLAN-T5-XL text conditioning
Training data CC12M + LAION Aesthetics v2 4.5 (6.0+ aesthetic subset)
Reward signals clip_score, aesthetic_score, image_reward_score, pick_a_score_score, hpsv2_score, vqa_score, sciscore_score
Weights model.safetensors, fp32 (EMA master weights — ready for finetuning)

Install

pip install miro-t2i

miro-t2i is the public PyPI package; it imports as import miro. The first call to MiroPipeline.from_pretrained(...) will additionally fetch google/flan-t5-xl (text encoder) and stabilityai/sdxl-vae (latent decoder) from the Hub.

Usage

import torch
from miro import MiroPipeline

pipe = MiroPipeline.from_pretrained("nicolas-dufour/miro")
pipe = pipe.to("cuda", torch.float16)

prompt = (
    "Photography closeup portrait of an adorable rusty broken­down steampunk "
    "robot covered in budding vegetation, surrounded by tall grass, misty "
    "futuristic sci­fi forest environment."
)
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.0)[0]
image.save("out.png")

Reward conditioning

MIRO conditions the flow model on a vector of reward targets in addition to the text prompt. By default every reward is requested at its maximum (1.0); you can override individual axes to bias generation toward a particular trade-off:

image = pipe(
    prompt,                       # the rusty-robot prompt from above
    reward_targets={
        "clip_score": 1.0,        # strict prompt alignment
        "aesthetic_score": 0.3,   # de-prioritise prettiness
        "image_reward_score": 1.0,  # prioritise general human preference
        # any reward not listed defaults to 1.0
    },
    negative_reward_targets={
        # zeros by default; what to push the unconditional branch toward
    },
    guidance_scale=7.0,
)[0]

The seven reward dimensions are:

Reward Normalised range What it measures
clip_score ~[0, 1] CLIP text–image alignment
aesthetic_score ~[0, 1] LAION aesthetic-quality predictor
image_reward_score ~[0, 1] ImageReward (general preference model)
pick_a_score_score ~[0, 1] PickScore (human preference)
hpsv2_score ~[0, 1] HPSv2 (human preference v2)
vqa_score ~[0, 1] VQAScore (compositional faithfulness)
sciscore_score ~[0, 1] SciScore (scientific-image plausibility)

Reported benchmarks

The paper reports the following headline numbers for the main MIRO model (this repo's nicolas-dufour/miro):

Metric MIRO (350M) FLUX-dev (12B)
GenEval (overall) 75 (with inference-time reward tuning) / 68 (default) 67
Inference compute ~370×
Aesthetic-metric convergence vs. baseline pretraining 19× faster

Per-variant scores (GenEval, FID, individual reward scores) for the eight ablations are reported in the paper's ablation tables. Please refer to arXiv:2510.25897 for the full breakdown.

Training compute and data

  • Default hardware: 2 nodes × 8 H100 GPUs (16× H100, 16-mixed precision)
  • Optimiser: LAMB, lr 1e-3 (5k warmup → cosine decay), weight decay 1e-2
  • Batch size: 1024 globally (64 per GPU on 16× H100), gradient-clip 2.0
  • Steps: 500 k (≈ ~29 epochs over the enriched training set)
  • Wall-clock on 16× H100: ~52 hours (≈ 2.65 train it/s sustained)
  • 8-GPU fallback: 1 node × 8 H100 with trainer.accumulate_grad_batches=2, measured at ≈ 1.45 train it/s96 hours (4 days) end-to-end. Requires trainer.strategy.static_graph=false and trainer.strategy.find_unused_parameters=true to play well with the self-conditioning skip in the loss; both flags are set automatically by miro/slurm/launch_multicad_synth_8gpu.py.
  • Data: CC12M + LAION Aesthetics v2 4.5 filtered to aesthetic_score >= 6.0 (the higher-quality subset), encoded to SDXL VAE latents at 256 resolution. Each sample is paired with seven reward scores and FLAN-T5-XL embeddings of both the original and a synthetic caption, computed by miro/data/preprocess_data.py.

Limitations and intended use

This checkpoint is a research artifact released to reproduce and build on the MIRO paper. Known limitations:

  • Resolution: 256×256 only. Higher-resolution outputs require upscaling.
  • Domain: trained on web-scraped image–caption pairs (CC12M + LAION Aesthetics 6.0). Inherits the biases of those datasets — including under-representation of many cultures, languages, and concepts, and the presence of stereotypes. Generations may reflect or amplify these biases.
  • Reward-model biases: the seven reward predictors used during training encode their own biases (e.g. aesthetic and human-preference models reflect the taste of their annotator pools). Conditioning on these rewards inherits and can sharpen those biases.
  • Not for safety-critical use: outputs are not factual and the SciScore reward does not guarantee scientific accuracy.
  • No safety filter is shipped with the model; users deploying it in user-facing settings should add their own.

The model is released under the MIT license; the SDXL VAE and FLAN-T5-XL encoder it depends on at inference time are loaded from stabilityai/sdxl-vae and google/flan-t5-xl and are subject to their respective licenses.

Citation

@inproceedings{dufour2026miro,
  title     = {{MIRO}: {M}ult{I}-{R}eward c{O}nditioned pretraining improves {T2I} quality and efficiency},
  author    = {Dufour, Nicolas and Degeorge, Lucas and Ghosh, Arijit and Kalogeiton, Vicky and Picard, David},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}

License

MIT — see https://github.com/nicolas-dufour/miro/blob/main/LICENSE.

Downloads last month
40
Safetensors
Model size
0.4B params
Tensor type
F32
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for nicolas-dufour/miro