LLM-grounded Video Diffusion Models

Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li at UC Berkeley/UCSF. ICLR 2024.

Project Page | Related Project: LMD | Citation

This model is based on modelscope but with additional conditioning from bounding boxes in a GLIGEN fashion.

Similar to LLM-grounded Diffusion (LMD), LLM-grounded Video Diffusion (LVD)'s boxes-to-video stage allows cross-attention-based bounding box conditioning, which uses ModelScope off-the-shelf. This huggingface model offers an alternative: we train a GLIGEN model (i.e., transformer adapters) with ModelScope's weights without the temporal transformers blocks on SA-1B, treating it as a SD v2.1 model that has been fine-tuned to 256x256 resolution. We then merge the adapters into ModelScope to offer conditioning. The resulting model is in this hugginface model. This can be used with cross-attention-based conditioning or on its own, similar to LMD+. This can be used with LLM-based text-to-dynamic scene layout generator in LVD, or on its own as a video version of GLIGEN.

Citation (LVD)

If you use our work, model, or our implementation in this repo, or find them helpful, please consider giving a citation.

@article{lian2023llmgroundedvideo,
      title={LLM-grounded Video Diffusion Models}, 
      author={Lian, Long and Shi, Baifeng and Yala, Adam and Darrell, Trevor and Li, Boyi},
      journal={arXiv preprint arXiv:2309.17444},
      year={2023},
}

@article{lian2023llmgrounded,
    title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models}, 
    author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},
    journal={arXiv preprint arXiv:2305.13655},
    year={2023}
}

Citation (GLIGEN)

The adapters in this model are trained in a mannar similar to training GLIGEN adapters.

@article{li2023gligen,
  title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
  author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
  journal={CVPR},
  year={2023}
}

Citation (ModelScope)

ModelScope is LVD's base model.

@article{wang2023modelscope,
    title={Modelscope text-to-video technical report},
    author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
    journal={arXiv preprint arXiv:2308.06571},
    year={2023}
}
@InProceedings{VideoFusion,
    author    = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
    title     = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023}
}

LICENSE

ModelScope follows CC-BY-NC 4.0 license. The gligen adapters are trained on SA-1B, which follows SA-1B license.