Edit model card

🧩 TokenCompose SD21 Model Card

TokenCompose_SD21_A is a latent text-to-image diffusion model finetuned from the Stable-Diffusion-v2-1 checkpoint at resolution 768x768 on the VSR split of COCO image-caption pairs for 32,000 steps with a learning rate of 5e-6. The training objective involves token-level grounding terms in addition to denoising loss for enhanced multi-category instance composition and photorealism. The "_A/B" postfix indicates different finetuning runs of the model using the same above configurations.

📄 Paper

Please follow this link.

🧨Example Usage

We strongly recommend using the 🤗Diffuser library to run our model.

import torch
from diffusers import StableDiffusionPipeline

model_id = "mlpc-lab/TokenCompose_SD21_A"
device = "cuda"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
pipe = pipe.to(device)

prompt = "A cat and a wine glass"
image = pipe(prompt).images[0]  
    
image.save("cat_and_wine_glass.png")

⬆️Improvements over SD21

Model Object Accuracy MG3 COCO MG4 COCO MG5 COCO MG3 ADE20K MG4 ADE20K MG5 ADE20K FID COCO
SD21 47.82 70.14 25.57 3.27 75.13 35.07 7.16 19.59
TokenCompose (SD21) 60.10 80.48 36.69 5.71 79.51 39.59 8.13 19.15

📰 Citation

@misc{wang2023tokencompose,
      title={TokenCompose: Grounding Diffusion with Token-level Supervision}, 
      author={Zirui Wang and Zhizhou Sha and Zheng Ding and Yilin Wang and Zhuowen Tu},
      year={2023},
      eprint={2312.03626},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month
4