|
# SEED-Story |
|
[![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/) |
|
[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/SEED-Story) |
|
[![Static Badge](https://img.shields.io/badge/Dataset-Huggingface-yellow)](https://huggingface.co/datasets/TencentARC/StoryStream) |
|
[![Static Badge](https://img.shields.io/badge/GitHub-black?logo=github)](https://github.com/TencentARC/SEED-Story) |
|
|
|
**TL;DR:** We introduce SEED-Story, a MLLM capable of generating multimodal |
|
long stories consists of rich and coherent narrative texts, along with images that are consistent in characters and |
|
style. We also release the StoryStream Dataset for build this model. |
|
|
|
## Introduction |
|
The introduced SEED-Story, powered by MLLM, is capable of generating multimodal long stories from user-provided images and texts as the beginning of the story. The generated story consists of rich and coherent narrative texts, along with images that are consistent in characters and style. The story can span up to 25 multimodal sequences, even though we only use a maximum of 10 sequences during training. |
|
<img src="assets/teaser.jpg" width="800" alt="Teaser image"> |
|
|
|
|
|
Overview of the SEED-Story. Training Pipeline: In Stage 1, we pre-trains an SD-XL-based de-tokenizer to reconstruct images by taking the features of a pre-trained ViT as inputs. In Stage 2, we sample an interleaved image-text sequence of a random length and train the MLLM by performing next-word prediction and image feature regression between the output hidden states of the learnable queries and ViT features of the target image. In Stage 3, the regressed image features from the MLLM are fed into the de-tokenizer for tuning SD-XL, enhancing the consistency of the characters and styles in the generated images. |
|
<img src="assets/pipeline.jpg" width="800" alt="Pipeline image"> |
|
|
|
|
|
## Model Weights |
|
We release the pretrained Tokenizer, the pretrained De-Tokenizer, the pre-trained foundation model **SEED-X-pretrained**, |
|
the StoryStream instruction-tuned MLLM **SEED-Story-George**, and the StoryStream tuned De-Tokenizer in **Detokenizer-George** [SEED-X-17B Hugging Face](https://huggingface.co/TencentARC/SEED-Story). |
|
|
|
Please download the checkpoints and save them under the folder `./pretrained`. |
|
|
|
You also need to download [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat), and save them under the folder `./pretrained`. Please use the following script to extract the weights of visual encoder in Qwen-VL-Chat. |
|
```bash |
|
python3 src/tools/reload_qwen_vit.py |
|
``` |
|
|
|
## Citation |
|
If you find the work helpful, please consider citing: |
|
```bash |
|
@article{ge2024seed, |
|
title={SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation}, |
|
author={Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying}, |
|
journal={arXiv preprint arXiv:2404.14396}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## License |
|
`SEED-Story` is licensed under the Apache License Version 2.0 except for the third-party components listed in [License](License_Seed-Story.txt). |