Visualize by Token as opposed to Word in the input

#3
by torridgristle - opened

brcOBAAAAAAE+VsPclEEADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIAD.AjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwIIADAjgAAAMwHR9DB58E13GgAAAABJRU5ErkJggg==.png

I know that caterpillargrillman isn't a token in CLIP (and for a more realistic example caterpillar isn't either), and individual parts of words like the initial cat- token have an effect different from what the word it's from (caterpillar) would suggest in some situations (cat fur instead of a caterpillar).

Thus, it would be more beneficial to see the attention per token rather than per word. Also it'd be neat to see what the initial token, "start of text", impacts since it doesn't generate proper images without it.

Also the additional tokens after the end of the prompt, the end of text tokens padding out to 77, can have an impact on the resulting image. By swapping out the last several tokens for another encoded prompt the output image can have some attributes from the other prompt, despite just being the end of text padding.

Perhaps the padding tokens should be visualized as a sum/avg of their attention (so it's not always plotting 77 maps).

Or remove the extra padding so there's just the one end of text token. It'll still generate a sensible image without all 77 tokens, as long as it still has the initial start of text token, and the unconditional encoding has the same number of tokens removed from the end. This way everything that can impact the image can be analyzed, without additional unseen changes from the padding end of text tokens that carry forward some of the previous tokens' concepts when encoded in CLIP.

Good idea. I'll add visualizing by token as an option sometime in December.

The special tokens themselves indeed have visualization value. I'll add that as well. I'll update this thread when that's done.

Sign up or log in to comment