Consistency

#108
by onesystem - opened

Hi,
This is a great app! Is there any way of keeping consistency over several pages. ie same characters and locations?
Thanks.

hello sorry but the answer is "no" for now it is the "ordinary" stable diffusion AI who generate all image and all are a new one everytime sure the story in the prompt can follow a story line and she can be made in a special style but it is all but i know the feeling those ai can be a little frustrating

Hello,

to give some insights into how it works, here is the schema for a SD prompt:

stable_diffusion_prompt =  prefix_from_preset  + user_input +  suffix_from_preset + panel_story_invented_by_llm 

More precisely, the LLM generate drawing instructions (not the caption we can see in the rectangle), like this:

[
    {
        "panel": 0,
        "instructions": "Close-up of a dog's face, looking excited and happy, tongue out and ears perk",
        "caption": "Woof woof!"
    },
    {
        "panel": 1,
        "instructions": "The dog jumps up and catches a ball in mid-air, with a big smile on its face.",
        "caption": "Catching some Zs."
    },
    {
        "panel": 2,
        "instructions": "The dog runs around in circles, holding the ball in its mouth, with a giant g",
        "caption": "Getting a few laughs."
    },
    {
        "panel": 3,
        "instructions": "The dog flops down on its back, panting, still holding the ball, with a satis",
        "caption": "Paws-itive vibes only."
    }
]

The instructions are cropped due to the limited size of the SD prompt (see below)

However, currently it is difficult to maintain context as:

Limited token window

There is a token window limit size of 77 tokens (so we can't write a lot of instructions in it).

It is is not 1 character = 1 token, but still, it is pretty small.

The style instructions eat up the token window

A big chunk of this 77 tokens is used to keep style consistency ("1950 comic" etc).
To make sure there is still room for the LLM prompt (even if the user type a long sentence, or the preset suffix is too long) I decided to limit the size of the user_input

No LoRA

It doesn't use a Lora, so sometimes a lot of words are required to create the style.

The problem with doing this and not using a LoRA is that it creates a risk of that those words appearing somewhere in the image, either literally (eg the words "character", "comic" , "style of" etc.. can appear in the speech bubble), or in the form of object (eg. asking for a "single comic panel" leads to the risk of seeing a metallic "panel" object in the image).

Issues with instruction following

The common implemention of Stable Diffusion is not necessarily the best suited to follow instructions.
Asking for specific number of objects, positions, camera angles (and text! especially text!) often leads to approximate results.

Using another architecture such as Dense Diffusion would perhaps solve some of those issues (same for text, there are probably other open-source SD models that do the job better)

So, as a "solution" (or rather, mitigation) @onesystem and @hullajump I can propose for now to try to be a more specific into what you want (color, clothes, age etc) but also concise, that way you know it will be added to each of the panel

Sign up or log in to comment