TencentARC
/

SEED-Story

Model card Files Files and versions Community

SEED-Story / README.md

Andyson's picture

init

7e9439d 4 months ago

|

3.25 kB

	# SEED-Story
	[![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/)
	[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/SEED-Story)
	[![Static Badge](https://img.shields.io/badge/Dataset-Huggingface-yellow)](https://huggingface.co/datasets/TencentARC/StoryStream)
	[![Static Badge](https://img.shields.io/badge/GitHub-black?logo=github)](https://github.com/TencentARC/SEED-Story)

	TL;DR: We introduce SEED-Story, a MLLM capable of generating multimodal
	long stories consists of rich and coherent narrative texts, along with images that are consistent in characters and
	style. We also release the StoryStream Dataset for build this model.

	## Introduction
	The introduced SEED-Story, powered by MLLM, is capable of generating multimodal long stories from user-provided images and texts as the beginning of the story. The generated story consists of rich and coherent narrative texts, along with images that are consistent in characters and style. The story can span up to 25 multimodal sequences, even though we only use a maximum of 10 sequences during training.
	<img src="assets/teaser.jpg" width="800" alt="Teaser image">


	Overview of the SEED-Story. Training Pipeline: In Stage 1, we pre-trains an SD-XL-based de-tokenizer to reconstruct images by taking the features of a pre-trained ViT as inputs. In Stage 2, we sample an interleaved image-text sequence of a random length and train the MLLM by performing next-word prediction and image feature regression between the output hidden states of the learnable queries and ViT features of the target image. In Stage 3, the regressed image features from the MLLM are fed into the de-tokenizer for tuning SD-XL, enhancing the consistency of the characters and styles in the generated images.
	<img src="assets/pipeline.jpg" width="800" alt="Pipeline image">


	## Model Weights
	We release the pretrained Tokenizer, the pretrained De-Tokenizer, the pre-trained foundation model SEED-X-pretrained,
	the StoryStream instruction-tuned MLLM SEED-Story-George, and the StoryStream tuned De-Tokenizer in Detokenizer-George [SEED-X-17B Hugging Face](https://huggingface.co/TencentARC/SEED-Story).

	Please download the checkpoints and save them under the folder `./pretrained`.

	You also need to download [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat), and save them under the folder `./pretrained`. Please use the following script to extract the weights of visual encoder in Qwen-VL-Chat.
	```bash
	python3 src/tools/reload_qwen_vit.py
	```

	## Citation
	If you find the work helpful, please consider citing:
	```bash
	@article{ge2024seed,
	title={SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation},
	author={Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying},
	journal={arXiv preprint arXiv:2404.14396},
	year={2024}
	}
	```

	## License
	`SEED-Story` is licensed under the Apache License Version 2.0 except for the third-party components listed in [License](License_Seed-Story.txt).