metadata

license: other
pipeline_tag: image-to-image

StableSR Model Card

This model card focuses on the models associated with the StableSR, available here.

Model Details

Developed by: Jianyi Wang
Model type: Diffusion-based image super-resolution model
License: S-Lab License 1.0
Model Description: This is the model used in Paper.
Resources for more information: GitHub Repository.

Cite as:

@InProceedings{wang2023exploiting,
    author    = {Wang, Jianyi and Yue, Zongsheng and Zhou, Shangchen and Chan, Kelvin CK and Loy, Chen Change},
    title     = {Exploiting Diffusion Prior for Real-World Image Super-Resolution},
    booktitle = {arXiv preprint arXiv:2305.07015},
    year      = {2023},
}

Uses

Please refer to S-Lab License 1.0

Limitations and Bias

Limitations

Bias

While our model is based on a pre-trained Stable Diffusion model, currently we do not observe obvious bias in generated results. We conjecture the main reason is that our model does not rely on text prompts but on low-resolution images. Such strong conditions make our model less likely to be affected.

Training

Training Data The model developer used the following dataset for training the model:

Our diffusion model is finetuned on DF2K (DIV2K and Flickr2K) + OST datasets, available here.
We further generate 100k synthetic LR-HR pairs on DF2K_OST using the finetuned diffusion model for training the CFW module.

Training Procedure StableSR is an image super-resolution model finetuned on Stable Diffusion, further equipped with a time-aware encoder and a controllable feature wrapping (CFW) module.

Following Stable Diffusion, images are encoded through the fixed VQGAN encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4.
The latent representations are fed to the time-aware encoder as guidance.
The loss is the same as Stable Diffusion.
After finetuning the diffusion model, we further train the CFW module using the data generated by the finetuned diffusion model.
The VQGAN model is fixed and only CFW is trainable.
The loss is similar to training a VQGAN, except that we use a fixed adversarial loss weight of 0.025 rather than a self-adjustable one.

We currently provide the following checkpoints:

stablesr_000117.ckpt: Diffusion model finetuned on DF2K_OST dataset for 117 epochs.
vqgan_cfw_00011.ckpt: CFW module with fixed VQGAN trained on synthetic paired data for 11 epochs.

Evaluation Results

See Paper for details.