SEED-Story

TL;DR: We introduce SEED-Story, a MLLM capable of generating multimodal long stories consists of rich and coherent narrative texts, along with images that are consistent in characters and style. We also release the StoryStream Dataset for build this model.

Introduction

The introduced SEED-Story, powered by MLLM, is capable of generating multimodal long stories from user-provided images and texts as the beginning of the story. The generated story consists of rich and coherent narrative texts, along with images that are consistent in characters and style. The story can span up to 25 multimodal sequences, even though we only use a maximum of 10 sequences during training. Teaser image

Overview of the SEED-Story. Training Pipeline: In Stage 1, we pre-trains an SD-XL-based de-tokenizer to reconstruct images by taking the features of a pre-trained ViT as inputs. In Stage 2, we sample an interleaved image-text sequence of a random length and train the MLLM by performing next-word prediction and image feature regression between the output hidden states of the learnable queries and ViT features of the target image. In Stage 3, the regressed image features from the MLLM are fed into the de-tokenizer for tuning SD-XL, enhancing the consistency of the characters and styles in the generated images. Pipeline image

Model Weights

We release the pretrained Tokenizer, the pretrained De-Tokenizer, the pre-trained foundation model SEED-X-pretrained, the StoryStream instruction-tuned MLLM SEED-Story-George, and the StoryStream tuned De-Tokenizer in Detokenizer-George SEED-X-17B Hugging Face.

Please download the checkpoints and save them under the folder ./pretrained.

You also need to download stable-diffusion-xl-base-1.0 and Qwen-VL-Chat, and save them under the folder ./pretrained. Please use the following script to extract the weights of visual encoder in Qwen-VL-Chat.

python3 src/tools/reload_qwen_vit.py

Citation

If you find the work helpful, please consider citing:

@article{ge2024seed,
  title={SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation},
  author={Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying},
  journal={arXiv preprint arXiv:2404.14396},
  year={2024}
}

License

SEED-Story is licensed under the Apache License Version 2.0 except for the third-party components listed in License.