Text-to-Image
Diffusers
English

StoryMaker: Towards consistent characters in text-to-image generation

GitHub

StoryMaker is a personalization solution preserves not only the consistency of faces but also clothing, hairstyles and bodies in the multiple characters scene, enabling the potential to make a story consisting of a series of images.

Visualization of generated images by StoryMaker. First three rows tell a story about a day in the life of a "office worker" and the last two rows tell a story about a movie of "Before Sunrise".

Demos

Two Portraits Synthesis

Diverse application

Download

You can directly download the model from Huggingface.

If you cannot access to Huggingface, you can use hf-mirror to download models.

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download RED-AIGC/StoryMaker --local-dir checkpoints --local-dir-use-symlinks False

For face encoder, you need to manually download via this URL to models/buffalo_l as the default link is invalid. Once you have prepared all models, the folder tree should be like:

  .
  β”œβ”€β”€ models
  β”œβ”€β”€ checkpoints/mask.bin
  β”œβ”€β”€ pipeline_sdxl_storymaker.py
  └── README.md

Usage

# !pip install opencv-python transformers accelerate insightface
import diffusers

import cv2
import torch
import numpy as np
from PIL import Image

from insightface.app import FaceAnalysis
from pipeline_sdxl_storymaker import StableDiffusionXLStoryMakerPipeline

# prepare 'buffalo_l' under ./models
app = FaceAnalysis(name='buffalo_l', root='./', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))

# prepare models under ./checkpoints
face_adapter = f'./checkpoints/mask.bin'
image_encoder_path = 'laion/CLIP-ViT-H-14-laion2B-s32B-b79K'  #  from https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K

base_model = 'huaquan/YamerMIX_v11'  # from https://huggingface.co/huaquan/YamerMIX_v11
pipe = StableDiffusionXLStoryMakerPipeline.from_pretrained(
    base_model,
    torch_dtype=torch.float16
)
pipe.cuda()

# load adapter
pipe.load_storymaker_adapter(image_encoder_path, face_adapter, scale=0.8, lora_scale=0.8)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

Then, you can customized your own images

# load an image and mask
face_image = Image.open("examples/ldh.png").convert('RGB')
mask_image = Image.open("examples/ldh_mask.png").convert('RGB')
    
face_info = app.get(cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR))
face_info = sorted(face_info, key=lambda x:(x['bbox'][2]-x['bbox'][0])*(x['bbox'][3]-x['bbox'][1]))[-1] # only use the maximum face

prompt = "a person is taking a selfie, the person is wearing a red hat, and a volcano is in the distance"
n_prompt = "bad quality, NSFW, low quality, ugly, disfigured, deformed"

generator = torch.Generator(device='cuda').manual_seed(666)
for i in range(4):
    output = pipe(
        image=image, mask_image=mask_image, face_info=face_info,
        prompt=prompt,
        negative_prompt=n_prompt,
        ip_adapter_scale=0.8, lora_scale=0.8,
        num_inference_steps=25,
        guidance_scale=7.5,
        height=1280, width=960,
        generator=generator,
    ).images[0]
    output.save(f'examples/results/ldh666_new_{i}.jpg')

Acknowledgements

  • Our work is highly inspired by IP-Adapter and InstantID. Thanks for their great works!
  • Thanks Yamer for developing YamerMIX, we use it as base model in our demo.
Downloads last month
357
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using RED-AIGC/StoryMaker 9