Upload sd_token_similarity_calculator.ipynb
Browse files- sd_token_similarity_calculator.ipynb +332 -26
sd_token_similarity_calculator.ipynb
CHANGED
|
@@ -17,12 +17,23 @@
|
|
| 17 |
{
|
| 18 |
"cell_type": "markdown",
|
| 19 |
"source": [
|
| 20 |
-
"This Notebook is a Stable-diffusion tool which allows you to find similiar tokens from the SD 1.5 vocab.json that you can use for text-to-image generation. Try this Free online SD 1.5 generator with the results: https://perchance.org/fusion-ai-image-generator"
|
|
|
|
|
|
|
| 21 |
],
|
| 22 |
"metadata": {
|
| 23 |
"id": "L7JTcbOdBPfh"
|
| 24 |
}
|
| 25 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
{
|
| 27 |
"cell_type": "code",
|
| 28 |
"source": [
|
|
@@ -88,6 +99,144 @@
|
|
| 88 |
"execution_count": null,
|
| 89 |
"outputs": []
|
| 90 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
{
|
| 92 |
"cell_type": "code",
|
| 93 |
"source": [
|
|
@@ -107,22 +256,10 @@
|
|
| 107 |
"#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID."
|
| 108 |
],
|
| 109 |
"metadata": {
|
| 110 |
-
"id": "RPdkYzT2_X85"
|
| 111 |
-
"colab": {
|
| 112 |
-
"base_uri": "https://localhost:8080/"
|
| 113 |
-
},
|
| 114 |
-
"outputId": "86f2f01e-6a04-4292-cee7-70fd8398e07f"
|
| 115 |
},
|
| 116 |
"execution_count": null,
|
| 117 |
-
"outputs": [
|
| 118 |
-
{
|
| 119 |
-
"output_type": "stream",
|
| 120 |
-
"name": "stdout",
|
| 121 |
-
"text": [
|
| 122 |
-
"[49406, 8922, 49407]\n"
|
| 123 |
-
]
|
| 124 |
-
}
|
| 125 |
-
]
|
| 126 |
},
|
| 127 |
{
|
| 128 |
"cell_type": "code",
|
|
@@ -353,21 +490,20 @@
|
|
| 353 |
"source": [
|
| 354 |
"\n",
|
| 355 |
"\n",
|
| 356 |
-
"
|
| 357 |
"\n",
|
| 358 |
"Similiar vectors = similiar output in the SD 1.5 / SDXL / FLUX model\n",
|
| 359 |
"\n",
|
| 360 |
-
"CLIP converts the prompt text to vectors (“tensors”) , with float32 values usually ranging from -1 to 1
|
| 361 |
"\n",
|
| 362 |
-
"Dimensions are [ 1x768 ] tensors for SD 1.5 , and a [ 1x768 , 1x1024 ] tensor for SDXL and FLUX.\n",
|
| 363 |
"\n",
|
| 364 |
"The SD models and FLUX converts these vectors to an image.\n",
|
| 365 |
"\n",
|
| 366 |
-
"This notebook takes an input string , tokenizes it and matches the first token against the 49407 token vectors in the vocab.json :
|
| 367 |
"\n",
|
| 368 |
"It finds the “most similiar tokens” in the list. Similarity is the theta angle between the token vectors.\n",
|
| 369 |
"\n",
|
| 370 |
-
"\n",
|
| 371 |
"<div>\n",
|
| 372 |
"<img src=\"https://huggingface.co/datasets/codeShare/sd_tokens/resolve/main/cosine.jpeg\" width=\"300\"/>\n",
|
| 373 |
"</div>\n",
|
|
@@ -376,19 +512,189 @@
|
|
| 376 |
"\n",
|
| 377 |
"Negative similarity is also possible.\n",
|
| 378 |
"\n",
|
| 379 |
-
"
|
| 380 |
"\n",
|
| 381 |
-
"
|
| 382 |
"\n",
|
| 383 |
-
"
|
| 384 |
"\n",
|
| 385 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 386 |
"\n",
|
| 387 |
"So TLDR; vector direction = “what to generate” , vector magnitude = “prompt weights”\n",
|
| 388 |
"\n",
|
| 389 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 390 |
"\n",
|
| 391 |
-
"
|
| 392 |
],
|
| 393 |
"metadata": {
|
| 394 |
"id": "njeJx_nSSA8H"
|
|
|
|
| 17 |
{
|
| 18 |
"cell_type": "markdown",
|
| 19 |
"source": [
|
| 20 |
+
"This Notebook is a Stable-diffusion tool which allows you to find similiar tokens from the SD 1.5 vocab.json that you can use for text-to-image generation. Try this Free online SD 1.5 generator with the results: https://perchance.org/fusion-ai-image-generator\n",
|
| 21 |
+
"\n",
|
| 22 |
+
"Scroll to the bottom of the notebook to see the guide for how this works."
|
| 23 |
],
|
| 24 |
"metadata": {
|
| 25 |
"id": "L7JTcbOdBPfh"
|
| 26 |
}
|
| 27 |
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "code",
|
| 30 |
+
"source": [],
|
| 31 |
+
"metadata": {
|
| 32 |
+
"id": "PBwVIuAjEdHA"
|
| 33 |
+
},
|
| 34 |
+
"execution_count": null,
|
| 35 |
+
"outputs": []
|
| 36 |
+
},
|
| 37 |
{
|
| 38 |
"cell_type": "code",
|
| 39 |
"source": [
|
|
|
|
| 99 |
"execution_count": null,
|
| 100 |
"outputs": []
|
| 101 |
},
|
| 102 |
+
{
|
| 103 |
+
"cell_type": "code",
|
| 104 |
+
"source": [
|
| 105 |
+
"# @title ⚡ Get similiar tokens\n",
|
| 106 |
+
"from transformers import AutoTokenizer\n",
|
| 107 |
+
"tokenizer = AutoTokenizer.from_pretrained(\"openai/clip-vit-large-patch14\", clean_up_tokenization_spaces = False)\n",
|
| 108 |
+
"\n",
|
| 109 |
+
"prompt= \"banana\" # @param {type:'string'}\n",
|
| 110 |
+
"\n",
|
| 111 |
+
"tokenizer_output = tokenizer(text = prompt)\n",
|
| 112 |
+
"input_ids = tokenizer_output['input_ids']\n",
|
| 113 |
+
"print(input_ids)\n",
|
| 114 |
+
"\n",
|
| 115 |
+
"\n",
|
| 116 |
+
"#The prompt will be enclosed with the <|start-of-text|> and <|end-of-text|> tokens, which is why output will be [49406, ... , 49407].\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID.\n",
|
| 119 |
+
"\n",
|
| 120 |
+
"id_A = input_ids[1]\n",
|
| 121 |
+
"A = token[id_A]\n",
|
| 122 |
+
"_A = LA.vector_norm(A, ord=2)\n",
|
| 123 |
+
"\n",
|
| 124 |
+
"#if no imput exists we just randomize the entire thing\n",
|
| 125 |
+
"if (prompt == \"\"):\n",
|
| 126 |
+
" id_A = -1\n",
|
| 127 |
+
" print(\"Tokenized prompt tensor A is a random valued tensor with no ID\")\n",
|
| 128 |
+
" R = torch.rand(768)\n",
|
| 129 |
+
" _R = LA.vector_norm(R, ord=2)\n",
|
| 130 |
+
" A = R*(_A/_R)\n",
|
| 131 |
+
"\n",
|
| 132 |
+
"\n",
|
| 133 |
+
"mix_with = \"\" # @param {\"type\":\"string\",\"placeholder\":\"(optional) write something else\"}\n",
|
| 134 |
+
"mix_method = \"None\" # @param [\"None\" , \"Average\", \"Subtract\"] {allow-input: true}\n",
|
| 135 |
+
"w = 0.5 # @param {type:\"slider\", min:0, max:1, step:0.01}\n",
|
| 136 |
+
"\n",
|
| 137 |
+
"tokenizer_output = tokenizer(text = mix_with)\n",
|
| 138 |
+
"input_ids = tokenizer_output['input_ids']\n",
|
| 139 |
+
"id_C = input_ids[1]\n",
|
| 140 |
+
"C = token[id_C]\n",
|
| 141 |
+
"_C = LA.vector_norm(C, ord=2)\n",
|
| 142 |
+
"\n",
|
| 143 |
+
"#if no imput exists we just randomize the entire thing\n",
|
| 144 |
+
"if (mix_with == \"\"):\n",
|
| 145 |
+
" id_C = -1\n",
|
| 146 |
+
" print(\"Tokenized prompt 'mix_with' tensor C is a random valued tensor with no ID\")\n",
|
| 147 |
+
" R = torch.rand(768)\n",
|
| 148 |
+
" _R = LA.vector_norm(R, ord=2)\n",
|
| 149 |
+
" C = R*(_C/_R)\n",
|
| 150 |
+
"\n",
|
| 151 |
+
"if (mix_method == \"None\"):\n",
|
| 152 |
+
" print(\"No operation\")\n",
|
| 153 |
+
"\n",
|
| 154 |
+
"if (mix_method == \"Average\"):\n",
|
| 155 |
+
" A = w*A + (1-w)*C\n",
|
| 156 |
+
" _A = LA.vector_norm(A, ord=2)\n",
|
| 157 |
+
" print(\"Tokenized prompt tensor A has been recalculated as A = w*A + (1-w)*C , where C is the tokenized prompt 'mix_with' tensor C\")\n",
|
| 158 |
+
"\n",
|
| 159 |
+
"if (mix_method == \"Subtract\"):\n",
|
| 160 |
+
" tmp = (A/_A) - (C/_C)\n",
|
| 161 |
+
" _tmp = LA.vector_norm(tmp, ord=2)\n",
|
| 162 |
+
" A = tmp*((w*_A + (1-w)*_C)/_tmp)\n",
|
| 163 |
+
" _A = LA.vector_norm(A, ord=2)\n",
|
| 164 |
+
" print(\"Tokenized prompt tensor A has been recalculated as A = (w*_A + (1-w)*_C) * norm(w*A - (1-w)*C) , where C is the tokenized prompt 'mix_with' tensor C\")\n",
|
| 165 |
+
"\n",
|
| 166 |
+
"#OPTIONAL : Add/subtract + normalize above result with another token. Leave field empty to get a random value tensor\n",
|
| 167 |
+
"\n",
|
| 168 |
+
"dots = torch.zeros(NUM_TOKENS)\n",
|
| 169 |
+
"for index in range(NUM_TOKENS):\n",
|
| 170 |
+
" id_B = index\n",
|
| 171 |
+
" B = token[id_B]\n",
|
| 172 |
+
" _B = LA.vector_norm(B, ord=2)\n",
|
| 173 |
+
" result = torch.dot(A,B)/(_A*_B)\n",
|
| 174 |
+
" #result = absolute_value(result.item())\n",
|
| 175 |
+
" result = result.item()\n",
|
| 176 |
+
" dots[index] = result\n",
|
| 177 |
+
"\n",
|
| 178 |
+
"name_A = \"A of random type\"\n",
|
| 179 |
+
"if (id_A>-1):\n",
|
| 180 |
+
" name_A = vocab[id_A]\n",
|
| 181 |
+
"\n",
|
| 182 |
+
"name_C = \"token C of random type\"\n",
|
| 183 |
+
"if (id_C>-1):\n",
|
| 184 |
+
" name_C = vocab[id_C]\n",
|
| 185 |
+
"\n",
|
| 186 |
+
"\n",
|
| 187 |
+
"sorted, indices = torch.sort(dots,dim=0 , descending=True)\n",
|
| 188 |
+
"#----#\n",
|
| 189 |
+
"if (mix_method == \"Average\"):\n",
|
| 190 |
+
" print(f'Calculated all cosine-similarities between the average of token {name_A} and {name_C} with Id_A = {id_A} and mixed Id_C = {id_C} as a 1x{sorted.shape[0]} tensor')\n",
|
| 191 |
+
"if (mix_method == \"Subtract\"):\n",
|
| 192 |
+
" print(f'Calculated all cosine-similarities between the subtract of token {name_A} and {name_C} with Id_A = {id_A} and mixed Id_C = {id_C} as a 1x{sorted.shape[0]} tensor')\n",
|
| 193 |
+
"if (mix_method == \"None\"):\n",
|
| 194 |
+
" print(f'Calculated all cosine-similarities between the token {name_A} with Id_A = {id_A} with the the rest of the {NUM_TOKENS} tokens as a 1x{sorted.shape[0]} tensor')\n",
|
| 195 |
+
"\n",
|
| 196 |
+
"#Produce a list id IDs that are most similiar to the prompt ID at positiion 1 based on above result\n",
|
| 197 |
+
"\n",
|
| 198 |
+
"list_size = 100 # @param {type:'number'}\n",
|
| 199 |
+
"print_ID = False # @param {type:\"boolean\"}\n",
|
| 200 |
+
"print_Similarity = True # @param {type:\"boolean\"}\n",
|
| 201 |
+
"print_Name = True # @param {type:\"boolean\"}\n",
|
| 202 |
+
"print_Divider = True # @param {type:\"boolean\"}\n",
|
| 203 |
+
"\n",
|
| 204 |
+
"\n",
|
| 205 |
+
"if (print_Divider):\n",
|
| 206 |
+
" print('//---//') # % value\n",
|
| 207 |
+
"\n",
|
| 208 |
+
"print('') # % value\n",
|
| 209 |
+
"print('Here is the result : ') # % value\n",
|
| 210 |
+
"print('') # % value\n",
|
| 211 |
+
"\n",
|
| 212 |
+
"for index in range(list_size):\n",
|
| 213 |
+
" id = indices[index].item()\n",
|
| 214 |
+
" if (print_Name):\n",
|
| 215 |
+
" print(f'{vocab[id]}') # vocab item\n",
|
| 216 |
+
" if (print_ID):\n",
|
| 217 |
+
" print(f'ID = {id}') # IDs\n",
|
| 218 |
+
" if (print_Similarity):\n",
|
| 219 |
+
" print(f'similiarity = {round(sorted[index].item()*100,2)} %') # % value\n",
|
| 220 |
+
" if (print_Divider):\n",
|
| 221 |
+
" print('--------')\n",
|
| 222 |
+
"\n",
|
| 223 |
+
"#Print the sorted list from above result"
|
| 224 |
+
],
|
| 225 |
+
"metadata": {
|
| 226 |
+
"id": "iWeFnT1gAx6A"
|
| 227 |
+
},
|
| 228 |
+
"execution_count": null,
|
| 229 |
+
"outputs": []
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"cell_type": "markdown",
|
| 233 |
+
"source": [
|
| 234 |
+
"# ↓ Sub modules (use these to build your own projects) ↓"
|
| 235 |
+
],
|
| 236 |
+
"metadata": {
|
| 237 |
+
"id": "_d8WtPgtAymM"
|
| 238 |
+
}
|
| 239 |
+
},
|
| 240 |
{
|
| 241 |
"cell_type": "code",
|
| 242 |
"source": [
|
|
|
|
| 256 |
"#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID."
|
| 257 |
],
|
| 258 |
"metadata": {
|
| 259 |
+
"id": "RPdkYzT2_X85"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 260 |
},
|
| 261 |
"execution_count": null,
|
| 262 |
+
"outputs": []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 263 |
},
|
| 264 |
{
|
| 265 |
"cell_type": "code",
|
|
|
|
| 490 |
"source": [
|
| 491 |
"\n",
|
| 492 |
"\n",
|
| 493 |
+
"# How does this notebook work?\n",
|
| 494 |
"\n",
|
| 495 |
"Similiar vectors = similiar output in the SD 1.5 / SDXL / FLUX model\n",
|
| 496 |
"\n",
|
| 497 |
+
"CLIP converts the prompt text to vectors (“tensors”) , with float32 values usually ranging from -1 to 1.\n",
|
| 498 |
"\n",
|
| 499 |
+
"Dimensions are \\[ 1x768 ] tensors for SD 1.5 , and a \\[ 1x768 , 1x1024 ] tensor for SDXL and FLUX.\n",
|
| 500 |
"\n",
|
| 501 |
"The SD models and FLUX converts these vectors to an image.\n",
|
| 502 |
"\n",
|
| 503 |
+
"This notebook takes an input string , tokenizes it and matches the first token against the 49407 token vectors in the vocab.json : [https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main/tokenizer](https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fblack-forest-labs%2FFLUX.1-dev%2Ftree%2Fmain%2Ftokenizer)\n",
|
| 504 |
"\n",
|
| 505 |
"It finds the “most similiar tokens” in the list. Similarity is the theta angle between the token vectors.\n",
|
| 506 |
"\n",
|
|
|
|
| 507 |
"<div>\n",
|
| 508 |
"<img src=\"https://huggingface.co/datasets/codeShare/sd_tokens/resolve/main/cosine.jpeg\" width=\"300\"/>\n",
|
| 509 |
"</div>\n",
|
|
|
|
| 512 |
"\n",
|
| 513 |
"Negative similarity is also possible.\n",
|
| 514 |
"\n",
|
| 515 |
+
"# How can I use it?\n",
|
| 516 |
"\n",
|
| 517 |
+
"If you are bored of prompting “girl” and want something similiar you can run this notebook and use the “chick” token at 21.88% similarity , for example\n",
|
| 518 |
"\n",
|
| 519 |
+
"You can also run a mixed search , like “cute+girl”/2 , where for example “kpop” has a 16.71% similarity\n",
|
| 520 |
"\n",
|
| 521 |
+
"There are some strange tokens further down the list you go. Example: tokens similiar to the token \"pewdiepie</w>\" (yes this is an actual token that exists in CLIP)\n",
|
| 522 |
+
"\n",
|
| 523 |
+
"<div>\n",
|
| 524 |
+
"<img src=\"https://lemmy.world/pictrs/image/a1cd284e-3341-4284-9949-5f8b58d3bd1f.jpeg\" width=\"300\"/>\n",
|
| 525 |
+
"</div>\n",
|
| 526 |
+
"\n",
|
| 527 |
+
"Each of these correspond to a unique 1x768 token vector.\n",
|
| 528 |
+
"\n",
|
| 529 |
+
"The higher the ID value , the less often the token appeared in the CLIP training data.\n",
|
| 530 |
+
"\n",
|
| 531 |
+
"To reiterate; this is the CLIP model training data , not the SD-model training data.\n",
|
| 532 |
+
"\n",
|
| 533 |
+
"So for certain models , tokens with high ID can give very consistent results , if the SD model is trained to handle them.\n",
|
| 534 |
+
"\n",
|
| 535 |
+
"Example of this can be anime models , where japanese artist names can affect the output greatly. \n",
|
| 536 |
+
"\n",
|
| 537 |
+
"Tokens with high ID will often give the \"fun\" output when used in very short prompts.\n",
|
| 538 |
+
"\n",
|
| 539 |
+
"# What about token vector length?\n",
|
| 540 |
+
"\n",
|
| 541 |
+
"If you are wondering about token magnitude,\n",
|
| 542 |
+
"Prompt weights like (banana:1.2) will scale the magnitude of the corresponding 1x768 tensor(s) by 1.2 . So thats how prompt token magnitude works.\n",
|
| 543 |
+
"\n",
|
| 544 |
+
"Source: [https://huggingface.co/docs/diffusers/main/en/using-diffusers/weighted\\_prompts](https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdiffusers%2Fmain%2Fen%2Fusing-diffusers%2Fweighted_prompts)\\*\n",
|
| 545 |
"\n",
|
| 546 |
"So TLDR; vector direction = “what to generate” , vector magnitude = “prompt weights”\n",
|
| 547 |
"\n",
|
| 548 |
+
"# How prompting works (technical summary)\n",
|
| 549 |
+
"\n",
|
| 550 |
+
" 1. There is no correct way to prompt.\n",
|
| 551 |
+
"\n",
|
| 552 |
+
"2. Stable diffusion reads your prompt left to right, one token at a time, finding association _from_ the previous token _to_ the current token _and to_ the image generated thus far (Cross Attention Rule)\n",
|
| 553 |
+
"\n",
|
| 554 |
+
"3. Stable Diffusion is an optimization problem that seeks to maximize similarity to prompt and minimize similarity to negatives (Optimization Rule)\n",
|
| 555 |
+
"\n",
|
| 556 |
+
"Reference material (covers entire SD , so not good source material really, but the info is there) : https://youtu.be/sFztPP9qPRc?si=ge2Ty7wnpPGmB0gi\n",
|
| 557 |
+
"\n",
|
| 558 |
+
"# The SD pipeline\n",
|
| 559 |
+
"\n",
|
| 560 |
+
"For every step (20 in total by default) for SD1.5 :\n",
|
| 561 |
+
"\n",
|
| 562 |
+
"1. Prompt text => (tokenizer)\n",
|
| 563 |
+
"2. => Nx768 token vectors =>(CLIP model) =>\n",
|
| 564 |
+
"3. 1x768 encoding => ( the SD model / Unet ) =>\n",
|
| 565 |
+
"4. => _Desired_ image per Rule 3 => ( sampler)\n",
|
| 566 |
+
"5. => Paint a section of the image => (image)\n",
|
| 567 |
+
"\n",
|
| 568 |
+
"# Disclaimer /Trivia\n",
|
| 569 |
+
"\n",
|
| 570 |
+
"This notebook should be seen as a \"dictionary search tool\" for the vocab.json , which is the same for SD1.5 , SDXL and FLUX. Feel free to verify this by checking the 'tokenizer' folder under each model.\n",
|
| 571 |
+
"\n",
|
| 572 |
+
"vocab.json in the FLUX model , for example (1 of 2 copies) : https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main/tokenizer\n",
|
| 573 |
+
"\n",
|
| 574 |
+
"I'm using Clip-vit-large-patch14 , which is used in SD 1.5 , and is one among the two tokenizers for SDXL and FLUX : https://huggingface.co/openai/clip-vit-large-patch14/blob/main/README.md\n",
|
| 575 |
+
"\n",
|
| 576 |
+
"This set of tokens has dimension 1x768. \n",
|
| 577 |
+
"\n",
|
| 578 |
+
"SDXL and FLUX uses an additional set of tokens of dimension 1x1024.\n",
|
| 579 |
+
"\n",
|
| 580 |
+
"These are not included in this notebook. Feel free to include them yourselves (I would appreciate that).\n",
|
| 581 |
+
"\n",
|
| 582 |
+
"To do so, you will have to download a FLUX and/or SDXL model\n",
|
| 583 |
+
"\n",
|
| 584 |
+
", and copy the 49407x1024 tensor list that is stored within the model and then save it as a .pt file.\n",
|
| 585 |
+
"\n",
|
| 586 |
+
"//---//\n",
|
| 587 |
+
"\n",
|
| 588 |
+
"I am aware it is actually the 1x768 text_encoding being processed into an image for the SD models + FLUX.\n",
|
| 589 |
+
"\n",
|
| 590 |
+
"As such , I've included text_encoding comparison at the bottom of the Notebook.\n",
|
| 591 |
+
"\n",
|
| 592 |
+
"I am also aware thar SDXL and FLUX uses additional encodings , which are not included in this notebook.\n",
|
| 593 |
+
"\n",
|
| 594 |
+
"* Clip-vit-bigG for SDXL: https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k/blob/main/README.md\n",
|
| 595 |
+
"\n",
|
| 596 |
+
"* And the T5 text encoder for FLUX. I have 0% understanding of FLUX T5 text_encoder.\n",
|
| 597 |
+
"\n",
|
| 598 |
+
"//---//\n",
|
| 599 |
+
"\n",
|
| 600 |
+
"If you want them , feel free to include them yourself and share the results (cuz I probably won't) :)!\n",
|
| 601 |
+
"\n",
|
| 602 |
+
"That being said , being an encoding , I reckon the CLIP Nx768 => 1x768 should be \"linear\" (or whatever one might call it)\n",
|
| 603 |
+
"\n",
|
| 604 |
+
"So exchange a few tokens in the Nx768 for something similiar , and the resulting 1x768 ought to be kinda similar to 1x768 we had earlier. Hopefully.\n",
|
| 605 |
+
"\n",
|
| 606 |
+
"I feel its important to mention this , in case some wonder why the token-token similarity don't match the text-encoding to text-encoding similarity.\n",
|
| 607 |
+
"\n",
|
| 608 |
+
"# Note regarding CLIP text encoding vs. token\n",
|
| 609 |
+
"\n",
|
| 610 |
+
"*To make this disclaimer clear; Token-to-token similarity is not the same as text_encoding similarity.*\n",
|
| 611 |
+
"\n",
|
| 612 |
+
"I have to say this , since it will otherwise get (even more) confusing , as both the individual tokens , and the text_encoding have dimensions 1x768.\n",
|
| 613 |
+
"\n",
|
| 614 |
+
"They are separate things. Separate results. etc.\n",
|
| 615 |
+
"\n",
|
| 616 |
+
"As such , you will not get anything useful if you start comparing similarity between a token , and a text-encoding. So don't do that :)!\n",
|
| 617 |
+
"\n",
|
| 618 |
+
"# What about the CLIP image encoding?\n",
|
| 619 |
+
"\n",
|
| 620 |
+
"The CLIP model can also do an image_encoding of an image, where the output will be a 1x768 tensor. These _can_ be compared with the text_encoding.\n",
|
| 621 |
+
"\n",
|
| 622 |
+
"Comparing CLIP image_encoding with the CLIP text_encoding for a bunch of random prompts until you find the \"highest similarity\" , is a method used in the CLIP interrogator : https://huggingface.co/spaces/pharmapsychotic/CLIP-Interrogator\n",
|
| 623 |
+
"\n",
|
| 624 |
+
"List of random prompts for CLIP interrogator can be found here, for reference : https://github.com/pharmapsychotic/clip-interrogator/tree/main/clip_interrogator/data\n",
|
| 625 |
+
"\n",
|
| 626 |
+
"The CLIP image_encoding is not included in this Notebook.\n",
|
| 627 |
+
"\n",
|
| 628 |
+
"If you spot errors / ideas for improvememts; feel free to fix the code in your own notebook and post the results.\n",
|
| 629 |
+
"\n",
|
| 630 |
+
"I'd appreciate that over people saying \"your math is wrong you n00b!\" with no constructive feedback.\n",
|
| 631 |
+
"\n",
|
| 632 |
+
"//---//\n",
|
| 633 |
+
"\n",
|
| 634 |
+
"Regarding output\n",
|
| 635 |
+
"\n",
|
| 636 |
+
"# What are the </w> symbols?\n",
|
| 637 |
+
"\n",
|
| 638 |
+
"The whitespace symbol indicate if the tokenized item ends with whitespace ( the suffix \"banana</w>\" => \"banana \" ) or not (the prefix \"post\" in \"post-apocalyptic \")\n",
|
| 639 |
+
"\n",
|
| 640 |
+
"For ease of reference , I call them prefix-tokens and suffix-tokens.\n",
|
| 641 |
+
"\n",
|
| 642 |
+
"Sidenote:\n",
|
| 643 |
+
"\n",
|
| 644 |
+
"Prefix tokens have the unique property in that they \"mutate\" suffix tokens\n",
|
| 645 |
+
"\n",
|
| 646 |
+
"Example: \"photo of a #prefix#-banana\"\n",
|
| 647 |
+
"\n",
|
| 648 |
+
"where #prefix# is a randomly selected prefix-token from the vocab.json\n",
|
| 649 |
+
"\n",
|
| 650 |
+
"The hyphen \"-\" exists to guarantee the tokenized text splits into the written #prefix# and #suffix# token respectively. The \"-\" hypen symbol can be replaced by any other special character of your choosing.\n",
|
| 651 |
+
"\n",
|
| 652 |
+
" Capital letters work too , e.g \"photo of a #prefix#Abanana\" since the capital letters A-Z are only listed once in the entire vocab.json.\n",
|
| 653 |
+
"\n",
|
| 654 |
+
"You can also choose to omit any separator and just rawdog it with the prompt \"photo of a #prefix#banana\" , however know that this may , on occasion , be tokenized as completely different tokens of lower ID:s.\n",
|
| 655 |
+
"\n",
|
| 656 |
+
"Curiously , common NSFW terms found online have in the CLIP model have been purposefully fragmented into separate #prefix# and #suffix# counterparts in the vocab.json. Likely for PR-reasons.\n",
|
| 657 |
+
"\n",
|
| 658 |
+
"You can verify the results using this online tokenizer: https://sd-tokenizer.rocker.boo/\n",
|
| 659 |
+
"\n",
|
| 660 |
+
"<div>\n",
|
| 661 |
+
"<img src=\"https://lemmy.world/pictrs/image/43467d75-7406-4a13-93ca-cdc469f944fc.jpeg\" width=\"300\"/>\n",
|
| 662 |
+
"<img src=\"https://lemmy.world/pictrs/image/c0411565-0cb3-47b1-a788-b368924d6f17.jpeg\" width=\"300\"/>\n",
|
| 663 |
+
"<img src=\"https://lemmy.world/pictrs/image/c27c6550-a88b-4543-9bd7-067dff016be2.jpeg\" width=\"300\"/>\n",
|
| 664 |
+
"</div>\n",
|
| 665 |
+
"\n",
|
| 666 |
+
"# What is that gibberish tokens that show up?\n",
|
| 667 |
+
"\n",
|
| 668 |
+
"The gibberish tokens like \"ðŁĺħ\\</w>\" are actually emojis!\n",
|
| 669 |
+
"\n",
|
| 670 |
+
"Try writing some emojis in this online tokenizer to see the results: https://sd-tokenizer.rocker.boo/\n",
|
| 671 |
+
"\n",
|
| 672 |
+
"It is a bit borked as it can't process capital letters properly.\n",
|
| 673 |
+
"\n",
|
| 674 |
+
"Also note that this is not reversible.\n",
|
| 675 |
+
"\n",
|
| 676 |
+
"If tokenization \"😅\" => ðŁĺħ</w>\n",
|
| 677 |
+
"\n",
|
| 678 |
+
"Then you can't prompt \"ðŁĺħ\" and expect to get the same result as the tokenized original emoji , \"😅\".\n",
|
| 679 |
+
"\n",
|
| 680 |
+
"SD 1.5 models actually have training for Emojis.\n",
|
| 681 |
+
"\n",
|
| 682 |
+
"But you have to set CLIP skip to 1 for this to work is intended.\n",
|
| 683 |
+
"\n",
|
| 684 |
+
"For example, this is the result from \"photo of a 🧔🏻♂️\"\n",
|
| 685 |
+
"\n",
|
| 686 |
+
"\n",
|
| 687 |
+
"<div>\n",
|
| 688 |
+
"<img src=\"https://lemmy.world/pictrs/image/e2b51aea-6960-4ad0-867e-8ce85f2bd51e.jpeg\" width=\"300\"/>\n",
|
| 689 |
+
"</div>\n",
|
| 690 |
+
"\n",
|
| 691 |
+
"A tutorial on stuff you can do with the vocab.list concluded.\n",
|
| 692 |
+
"\n",
|
| 693 |
+
"Anyways, have fun with the notebook.\n",
|
| 694 |
+
"\n",
|
| 695 |
+
"There might be some updates in the future with features not mentioned here.\n",
|
| 696 |
"\n",
|
| 697 |
+
"//---//"
|
| 698 |
],
|
| 699 |
"metadata": {
|
| 700 |
"id": "njeJx_nSSA8H"
|