hakurei/waifu-diffusion-v1-4 · Proposed change to CLIP text encoding for tag style

tl;dr create attention masks from the commas to isolate each tag, blend concepts only in the image / Stable Diffusion and not in CLIP.

I strongly believe this would help disentangle features / tags.

The default attention masking for text in CLIP is to not allow tokens to be impacted by following tokens, only previous tokens.

However, the prompt style of comma separated tags seems ill-suited for this since each tag is an isolated concept on its own, separated by commas. This results in different tags and commas influencing later tags, meaning the order of tags changes the result.

Skipping a few layers at the end may reduce how much the tokens / words are mixed together, but it doesn't change how they're mixed together where each token is affected by every previous token.

I propose a new attention masking method to replace the default, where individual groups of tokens separated by commas are masked to only allow them to modify their own group. Then Stable Diffusion can blend the concepts together in the image, rather than by CLIP in the encoded prompt.

Also to remove commas from the tokens that are actually used when encoding the prompt, keeping them in prompts only to form the attention masks.

It should restart the usual pattern of blocking out previous tokens on each comma-separated tag, not just allowing all tokens to modify each other regardless of order or arrangement, in order to preserve the interpretation of words that are split into multiple tokens.

Apologies if rambling, the idea has been in my head for a while.