metadata
license: creativeml-openrail-m
language:
- en
library_name: diffusers
pipeline_tag: text-to-image
tags:
- stable-diffusion
🧩 TokenCompose SD21 Model Card
🎬CVPR 2024
TokenCompose_SD21_A is a latent text-to-image diffusion model finetuned from the Stable-Diffusion-v2-1 checkpoint at resolution 768x768 on the VSR split of COCO image-caption pairs for 32,000 steps with a learning rate of 5e-6. The training objective involves token-level grounding terms in addition to denoising loss for enhanced multi-category instance composition and photorealism. The "_A/B" postfix indicates different finetuning runs of the model using the same above configurations.
📄 Paper
Please follow this link.
🧨Example Usage
We strongly recommend using the 🤗Diffuser library to run our model.
import torch
from diffusers import StableDiffusionPipeline
model_id = "mlpc-lab/TokenCompose_SD21_A"
device = "cuda"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
pipe = pipe.to(device)
prompt = "A cat and a wine glass"
image = pipe(prompt).images[0]
image.save("cat_and_wine_glass.png")
⬆️Improvements over SD21
Model | Object Accuracy | MG3 COCO | MG4 COCO | MG5 COCO | MG3 ADE20K | MG4 ADE20K | MG5 ADE20K | FID COCO |
---|---|---|---|---|---|---|---|---|
SD21 | 47.82 | 70.14 | 25.57 | 3.27 | 75.13 | 35.07 | 7.16 | 19.59 |
TokenCompose (SD21) | 60.10 | 80.48 | 36.69 | 5.71 | 79.51 | 39.59 | 8.13 | 19.15 |
📰 Citation
@misc{wang2023tokencompose,
title={TokenCompose: Grounding Diffusion with Token-level Supervision},
author={Zirui Wang and Zhizhou Sha and Zheng Ding and Yilin Wang and Zhuowen Tu},
year={2023},
eprint={2312.03626},
archivePrefix={arXiv},
primaryClass={cs.CV}
}