--- license: creativeml-openrail-m language: - en library_name: diffusers pipeline_tag: text-to-image tags: - stable-diffusion - cvpr - text-to-image - image-generation - compositionality --- # 🧩 TokenCompose SD14 Model Card ## 🎬CVPR 2024 [TokenCompose_SD14_A](https://mlpc-ucsd.github.io/TokenCompose/) is a [latent text-to-image diffusion model](https://arxiv.org/abs/2112.10752) finetuned from the [**Stable-Diffusion-v1-4**](https://huggingface.co/CompVis/stable-diffusion-v1-4) checkpoint at resolution 512x512 on the [VSR](https://github.com/cambridgeltl/visual-spatial-reasoning) split of [COCO image-caption pairs](https://cocodataset.org/#download) for 24,000 steps with a learning rate of 5e-6. The training objective involves token-level grounding terms in addition to denoising loss for enhanced multi-category instance composition and photorealism. The "_A/B" postfix indicates different finetuning runs of the model using the same above configurations. # 📄 Paper Please follow [this](https://arxiv.org/abs/2312.03626) link. # 🧨Example Usage We strongly recommend using the [🤗Diffuser](https://github.com/huggingface/diffusers) library to run our model. ```python import torch from diffusers import StableDiffusionPipeline model_id = "mlpc-lab/TokenCompose_SD14_A" device = "cuda" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32) pipe = pipe.to(device) prompt = "A cat and a wine glass" image = pipe(prompt).images[0] image.save("cat_and_wine_glass.png") ``` # ⬆️Improvements over SD14

Method	Multi-category Instance Composition									Photorealism		Efficiency
	Object Accuracy	COCO				ADE20K				FID (COCO)	FID (Flickr30K)	Latency
	Object Accuracy	MG2	MG3	MG4	MG5	MG2	MG3	MG4	MG5	FID (COCO)	FID (Flickr30K)	Latency
SD 1.4	29.86	90.72_1.33	50.74_0.89	11.68_0.45	0.88_0.21	89.81_0.40	53.96_1.14	16.52_1.13	1.89_0.34	20.88	71.46	7.54_0.17
TokenCompose (Ours)	52.15	98.08_0.40	76.16_1.04	28.81_0.95	3.28_0.48	97.75_0.34	76.93_1.09	33.92_1.47	6.21_0.62	20.19	71.13	7.56_0.14

# 📰 Citation ```bibtex @InProceedings{Wang2024TokenCompose, author = {Wang, Zirui and Sha, Zhizhou and Ding, Zheng and Wang, Yilin and Tu, Zhuowen}, title = {TokenCompose: Text-to-Image Diffusion with Token-level Supervision}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {8553-8564} } ```