codeShare
/

JupyterNotebooks

Model card Files Files and versions

xet

Community

codeShare commited on Sep 9, 2024

Commit

16db48a

verified ·

1 Parent(s): e754fde

Upload sd_token_similarity_calculator.ipynb

Browse files

Files changed (1) hide show

sd_token_similarity_calculator.ipynb +332 -26

sd_token_similarity_calculator.ipynb CHANGED Viewed

@@ -17,12 +17,23 @@
     {
       "cell_type": "markdown",
       "source": [
-        "This Notebook is a Stable-diffusion tool which allows you to find similiar tokens from the SD 1.5 vocab.json that you can use for text-to-image generation. Try this Free online SD 1.5 generator with the results: https://perchance.org/fusion-ai-image-generator"
       ],
       "metadata": {
         "id": "L7JTcbOdBPfh"
       }
     },
     {
       "cell_type": "code",
       "source": [
@@ -88,6 +99,144 @@
       "execution_count": null,
       "outputs": []
     },
     {
       "cell_type": "code",
       "source": [
@@ -107,22 +256,10 @@
         "#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID."
       ],
       "metadata": {
-        "id": "RPdkYzT2_X85",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "86f2f01e-6a04-4292-cee7-70fd8398e07f"
       },
       "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "[49406, 8922, 49407]\n"
-          ]
-        }
-      ]
     },
     {
       "cell_type": "code",
@@ -353,21 +490,20 @@
       "source": [
         "\n",
         "\n",
-        "This is how the notebook works:\n",
         "\n",
         "Similiar vectors = similiar output in the SD 1.5 / SDXL / FLUX model\n",
         "\n",
-        "CLIP converts the prompt text to vectors (“tensors”) , with float32 values usually ranging from -1 to 1\n",
         "\n",
-        "Dimensions are [ 1x768 ] tensors for SD 1.5 , and a [ 1x768 , 1x1024 ] tensor for SDXL and FLUX.\n",
         "\n",
         "The SD models and FLUX converts these vectors to an image.\n",
         "\n",
-        "This notebook takes an input string , tokenizes it and matches the first token against the 49407 token vectors in the vocab.json : https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main/tokenizer\n",
         "\n",
         "It finds the “most similiar tokens” in the list. Similarity is the theta angle between the token vectors.\n",
         "\n",
-        "\n",
         "<div>\n",
         "<img src=\"https://huggingface.co/datasets/codeShare/sd_tokens/resolve/main/cosine.jpeg\" width=\"300\"/>\n",
         "</div>\n",
@@ -376,19 +512,189 @@
         "\n",
         "Negative similarity is also possible.\n",
         "\n",
-        "So if you are bored of prompting “girl” and want something similiar you can run this notebook and use the “chick</w>” token at 21.88% similarity , for example\n",
         "\n",
-        "You can also run a mixed search , like “cute+girl”/2 , where for example “kpop</w>” has a 16.71% similarity\n",
         "\n",
-        "Sidenote: Prompt weights like (banana:1.2) will scale the magnitude of the corresponding 1x768 tensor(s) by 1.2 .\n",
         "\n",
-        "Source: https://huggingface.co/docs/diffusers/main/en/using-diffusers/weighted_prompts*\n",
         "\n",
         "So TLDR; vector direction = “what to generate” , vector magnitude = “prompt weights”\n",
         "\n",
-        "/---/\n",
         "\n",
-        "Read more about CLIP here: https://huggingface.co/docs/transformers/model_doc/clip"
       ],
       "metadata": {
         "id": "njeJx_nSSA8H"

     {
       "cell_type": "markdown",
       "source": [
+        "This Notebook is a Stable-diffusion tool which allows you to find similiar tokens from the SD 1.5 vocab.json that you can use for text-to-image generation. Try this Free online SD 1.5 generator with the results: https://perchance.org/fusion-ai-image-generator\n",
+        "\n",
+        "Scroll to the bottom of the notebook to see the guide for how this works."
       ],
       "metadata": {
         "id": "L7JTcbOdBPfh"
       }
     },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "PBwVIuAjEdHA"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
     {
       "cell_type": "code",
       "source": [
       "execution_count": null,
       "outputs": []
     },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title ⚡ Get similiar tokens\n",
+        "from transformers import AutoTokenizer\n",
+        "tokenizer = AutoTokenizer.from_pretrained(\"openai/clip-vit-large-patch14\", clean_up_tokenization_spaces = False)\n",
+        "\n",
+        "prompt= \"banana\" # @param {type:'string'}\n",
+        "\n",
+        "tokenizer_output = tokenizer(text = prompt)\n",
+        "input_ids = tokenizer_output['input_ids']\n",
+        "print(input_ids)\n",
+        "\n",
+        "\n",
+        "#The prompt will be enclosed with the <|start-of-text|> and <|end-of-text|> tokens, which is why output will be [49406, ... , 49407].\n",
+        "\n",
+        "#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID.\n",
+        "\n",
+        "id_A = input_ids[1]\n",
+        "A = token[id_A]\n",
+        "_A = LA.vector_norm(A, ord=2)\n",
+        "\n",
+        "#if no imput exists we just randomize the entire thing\n",
+        "if (prompt == \"\"):\n",
+        "  id_A = -1\n",
+        "  print(\"Tokenized prompt tensor A is a random valued tensor with no ID\")\n",
+        "  R = torch.rand(768)\n",
+        "  _R =  LA.vector_norm(R, ord=2)\n",
+        "  A = R*(_A/_R)\n",
+        "\n",
+        "\n",
+        "mix_with = \"\" # @param {\"type\":\"string\",\"placeholder\":\"(optional) write something else\"}\n",
+        "mix_method = \"None\" # @param [\"None\" , \"Average\", \"Subtract\"] {allow-input: true}\n",
+        "w = 0.5 # @param {type:\"slider\", min:0, max:1, step:0.01}\n",
+        "\n",
+        "tokenizer_output = tokenizer(text = mix_with)\n",
+        "input_ids = tokenizer_output['input_ids']\n",
+        "id_C = input_ids[1]\n",
+        "C = token[id_C]\n",
+        "_C = LA.vector_norm(C, ord=2)\n",
+        "\n",
+        "#if no imput exists we just randomize the entire thing\n",
+        "if (mix_with == \"\"):\n",
+        "  id_C = -1\n",
+        "  print(\"Tokenized prompt  'mix_with' tensor C is a random valued tensor with no ID\")\n",
+        "  R = torch.rand(768)\n",
+        "  _R =  LA.vector_norm(R, ord=2)\n",
+        "  C = R*(_C/_R)\n",
+        "\n",
+        "if (mix_method ==  \"None\"):\n",
+        "  print(\"No operation\")\n",
+        "\n",
+        "if (mix_method ==  \"Average\"):\n",
+        "  A = w*A + (1-w)*C\n",
+        "  _A = LA.vector_norm(A, ord=2)\n",
+        "  print(\"Tokenized prompt tensor A has been recalculated as A = w*A + (1-w)*C , where C is the tokenized prompt  'mix_with' tensor C\")\n",
+        "\n",
+        "if (mix_method ==  \"Subtract\"):\n",
+        "  tmp = (A/_A) - (C/_C)\n",
+        "  _tmp = LA.vector_norm(tmp, ord=2)\n",
+        "  A = tmp*((w*_A + (1-w)*_C)/_tmp)\n",
+        "  _A = LA.vector_norm(A, ord=2)\n",
+        "  print(\"Tokenized prompt tensor A has been recalculated as A = (w*_A + (1-w)*_C) * norm(w*A - (1-w)*C) , where C is the tokenized prompt 'mix_with' tensor C\")\n",
+        "\n",
+        "#OPTIONAL : Add/subtract + normalize above result with another token. Leave field empty to get a random value tensor\n",
+        "\n",
+        "dots = torch.zeros(NUM_TOKENS)\n",
+        "for index in range(NUM_TOKENS):\n",
+        "  id_B = index\n",
+        "  B = token[id_B]\n",
+        "  _B = LA.vector_norm(B, ord=2)\n",
+        "  result = torch.dot(A,B)/(_A*_B)\n",
+        "  #result = absolute_value(result.item())\n",
+        "  result = result.item()\n",
+        "  dots[index] = result\n",
+        "\n",
+        "name_A = \"A of random type\"\n",
+        "if (id_A>-1):\n",
+        "  name_A = vocab[id_A]\n",
+        "\n",
+        "name_C = \"token C of random type\"\n",
+        "if (id_C>-1):\n",
+        "  name_C = vocab[id_C]\n",
+        "\n",
+        "\n",
+        "sorted, indices = torch.sort(dots,dim=0 , descending=True)\n",
+        "#----#\n",
+        "if (mix_method ==  \"Average\"):\n",
+        "  print(f'Calculated all cosine-similarities between the average of token {name_A} and {name_C} with Id_A = {id_A} and mixed Id_C = {id_C} as a 1x{sorted.shape[0]} tensor')\n",
+        "if (mix_method ==  \"Subtract\"):\n",
+        "  print(f'Calculated all cosine-similarities between the subtract of token {name_A} and {name_C} with Id_A = {id_A} and mixed Id_C = {id_C} as a 1x{sorted.shape[0]} tensor')\n",
+        "if (mix_method ==  \"None\"):\n",
+        "  print(f'Calculated all cosine-similarities between the token {name_A} with Id_A = {id_A} with the the rest of the {NUM_TOKENS} tokens as a 1x{sorted.shape[0]} tensor')\n",
+        "\n",
+        "#Produce a list id IDs that are most similiar to the prompt ID at positiion 1 based on above result\n",
+        "\n",
+        "list_size = 100 # @param {type:'number'}\n",
+        "print_ID = False # @param {type:\"boolean\"}\n",
+        "print_Similarity = True # @param {type:\"boolean\"}\n",
+        "print_Name = True # @param {type:\"boolean\"}\n",
+        "print_Divider = True # @param {type:\"boolean\"}\n",
+        "\n",
+        "\n",
+        "if (print_Divider):\n",
+        "  print('//---//') # % value\n",
+        "\n",
+        "print('') # % value\n",
+        "print('Here is the result : ') # % value\n",
+        "print('') # % value\n",
+        "\n",
+        "for index in range(list_size):\n",
+        "  id = indices[index].item()\n",
+        "  if (print_Name):\n",
+        "    print(f'{vocab[id]}') # vocab item\n",
+        "  if (print_ID):\n",
+        "    print(f'ID = {id}') # IDs\n",
+        "  if (print_Similarity):\n",
+        "    print(f'similiarity = {round(sorted[index].item()*100,2)} %') # % value\n",
+        "  if (print_Divider):\n",
+        "    print('--------')\n",
+        "\n",
+        "#Print the sorted list from above result"
+      ],
+      "metadata": {
+        "id": "iWeFnT1gAx6A"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#  ↓ Sub modules (use these to build your own projects) ↓"
+      ],
+      "metadata": {
+        "id": "_d8WtPgtAymM"
+      }
+    },
     {
       "cell_type": "code",
       "source": [
         "#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID."
       ],
       "metadata": {
+        "id": "RPdkYzT2_X85"
       },
       "execution_count": null,
+      "outputs": []
     },
     {
       "cell_type": "code",
       "source": [
         "\n",
         "\n",
+        "# How does this notebook work?\n",
         "\n",
         "Similiar vectors = similiar output in the SD 1.5 / SDXL / FLUX model\n",
         "\n",
+        "CLIP converts the prompt text to vectors (“tensors”) , with float32 values usually ranging from -1 to 1.\n",
         "\n",
+        "Dimensions are \\[ 1x768 ] tensors for SD 1.5 , and a \\[ 1x768 , 1x1024 ] tensor for SDXL and FLUX.\n",
         "\n",
         "The SD models and FLUX converts these vectors to an image.\n",
         "\n",
+        "This notebook takes an input string , tokenizes it and matches the first token against the 49407 token vectors in the vocab.json : [https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main/tokenizer](https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fblack-forest-labs%2FFLUX.1-dev%2Ftree%2Fmain%2Ftokenizer)\n",
         "\n",
         "It finds the “most similiar tokens” in the list. Similarity is the theta angle between the token vectors.\n",
         "\n",
         "<div>\n",
         "<img src=\"https://huggingface.co/datasets/codeShare/sd_tokens/resolve/main/cosine.jpeg\" width=\"300\"/>\n",
         "</div>\n",
         "\n",
         "Negative similarity is also possible.\n",
         "\n",
+        "# How can I use it?\n",
         "\n",
+        "If you are bored of prompting “girl” and want something similiar you can run this notebook and use the “chick” token at 21.88% similarity , for example\n",
         "\n",
+        "You can also run a mixed search , like “cute+girl”/2 , where for example “kpop” has a 16.71% similarity\n",
         "\n",
+        "There are some strange tokens further down the list you go. Example: tokens similiar to the token \"pewdiepie</w>\" (yes this is an actual token that exists in CLIP)\n",
+        "\n",
+        "<div>\n",
+        "<img src=\"https://lemmy.world/pictrs/image/a1cd284e-3341-4284-9949-5f8b58d3bd1f.jpeg\" width=\"300\"/>\n",
+        "</div>\n",
+        "\n",
+        "Each of these correspond to a unique 1x768 token vector.\n",
+        "\n",
+        "The higher the ID value , the less often the token appeared in the CLIP training data.\n",
+        "\n",
+        "To reiterate; this is the CLIP model training data , not the SD-model training data.\n",
+        "\n",
+        "So for certain models , tokens with high ID can give very consistent results , if the SD model is trained to handle them.\n",
+        "\n",
+        "Example of this can be anime models , where japanese artist names can affect the output greatly.  \n",
+        "\n",
+        "Tokens with high ID will often give the \"fun\" output when used in very short prompts.\n",
+        "\n",
+        "# What about token vector length?\n",
+        "\n",
+        "If you are wondering about token magnitude,\n",
+        "Prompt weights like (banana:1.2) will scale the magnitude of the corresponding 1x768 tensor(s) by 1.2 . So thats how prompt token magnitude works.\n",
+        "\n",
+        "Source: [https://huggingface.co/docs/diffusers/main/en/using-diffusers/weighted\\_prompts](https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdiffusers%2Fmain%2Fen%2Fusing-diffusers%2Fweighted_prompts)\\*\n",
         "\n",
         "So TLDR; vector direction = “what to generate” , vector magnitude = “prompt weights”\n",
         "\n",
+        "# How prompting works (technical summary)\n",
+        "\n",
+        " 1. There is no correct way to prompt.\n",
+        "\n",
+        "2. Stable diffusion reads your prompt left to right, one token at a time, finding association _from_ the previous token _to_ the current token _and to_ the image generated thus far (Cross Attention Rule)\n",
+        "\n",
+        "3. Stable Diffusion is an optimization problem that seeks to maximize similarity to prompt and minimize similarity to negatives  (Optimization Rule)\n",
+        "\n",
+        "Reference material (covers entire SD , so not good source material really, but the info is there)  : https://youtu.be/sFztPP9qPRc?si=ge2Ty7wnpPGmB0gi\n",
+        "\n",
+        "# The SD pipeline\n",
+        "\n",
+        "For every step (20 in total by default) for SD1.5 :\n",
+        "\n",
+        "1. Prompt text =>  (tokenizer)\n",
+        "2. => Nx768 token vectors =>(CLIP model) =>\n",
+        "3. 1x768 encoding => ( the SD model / Unet ) =>\n",
+        "4. => _Desired_ image per Rule 3 => ( sampler)\n",
+        "5. => Paint a section of the image => (image)\n",
+        "\n",
+        "# Disclaimer /Trivia\n",
+        "\n",
+        "This notebook should be seen as a \"dictionary search tool\"  for the vocab.json , which is the same for SD1.5 , SDXL and FLUX. Feel free to verify this by checking the 'tokenizer' folder under each model.\n",
+        "\n",
+        "vocab.json in the FLUX model , for example (1 of 2 copies) : https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main/tokenizer\n",
+        "\n",
+        "I'm using Clip-vit-large-patch14 , which is used in SD 1.5 , and is one among the two tokenizers for SDXL and FLUX  : https://huggingface.co/openai/clip-vit-large-patch14/blob/main/README.md\n",
+        "\n",
+        "This set of tokens has dimension 1x768.  \n",
+        "\n",
+        "SDXL and FLUX uses an additional set of tokens of dimension 1x1024.\n",
+        "\n",
+        "These are not included in this notebook. Feel free to include them yourselves (I would appreciate that).\n",
+        "\n",
+        "To do so, you will have to download a FLUX and/or SDXL model\n",
+        "\n",
+        ", and copy the 49407x1024 tensor list that is stored within the model and then save it as a .pt file.\n",
+        "\n",
+        "//---//\n",
+        "\n",
+        "I am aware it is actually the 1x768 text_encoding being processed into an image for the SD models + FLUX.\n",
+        "\n",
+        "As such , I've included text_encoding comparison at the bottom of the Notebook.\n",
+        "\n",
+        "I am also aware thar SDXL and FLUX uses additional encodings , which are not included in this notebook.\n",
+        "\n",
+        "* Clip-vit-bigG for SDXL: https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k/blob/main/README.md\n",
+        "\n",
+        "* And the T5 text encoder for FLUX. I have 0% understanding of FLUX T5 text_encoder.\n",
+        "\n",
+        "//---//\n",
+        "\n",
+        "If you want them , feel free to include them yourself and share the results (cuz I probably won't)  :)!\n",
+        "\n",
+        "That being said , being an encoding , I reckon the CLIP Nx768 => 1x768 should be \"linear\" (or whatever one might call it)\n",
+        "\n",
+        "So exchange a few tokens in the Nx768 for something similiar , and the resulting 1x768 ought to be kinda similar to 1x768 we had earlier. Hopefully.\n",
+        "\n",
+        "I feel its important to mention this , in case some wonder why the token-token similarity don't match the text-encoding to text-encoding similarity.\n",
+        "\n",
+        "# Note regarding CLIP text encoding vs. token\n",
+        "\n",
+        "*To make this disclaimer clear; Token-to-token similarity is not the same as text_encoding similarity.*\n",
+        "\n",
+        "I have to say this , since it will  otherwise get (even more) confusing , as both the individual tokens , and the text_encoding have dimensions 1x768.\n",
+        "\n",
+        "They are separate things. Separate results. etc.\n",
+        "\n",
+        "As such , you will not get anything useful if you start comparing similarity between a token , and a text-encoding. So don't do that :)!\n",
+        "\n",
+        "# What about the CLIP image encoding?\n",
+        "\n",
+        "The CLIP model can also do an image_encoding of an image, where the output will be a 1x768 tensor. These _can_ be compared with the text_encoding.\n",
+        "\n",
+        "Comparing CLIP image_encoding with the CLIP text_encoding for a bunch of random prompts until you find the \"highest similarity\" , is a method used in the CLIP interrogator : https://huggingface.co/spaces/pharmapsychotic/CLIP-Interrogator\n",
+        "\n",
+        "List of random prompts for CLIP interrogator  can be found here, for reference : https://github.com/pharmapsychotic/clip-interrogator/tree/main/clip_interrogator/data\n",
+        "\n",
+        "The CLIP image_encoding is not included in this Notebook.\n",
+        "\n",
+        "If you spot errors / ideas for improvememts; feel free to fix the code in your own notebook and post the results.\n",
+        "\n",
+        "I'd appreciate that over people saying \"your math is wrong you n00b!\" with no constructive feedback.\n",
+        "\n",
+        "//---//\n",
+        "\n",
+        "Regarding output\n",
+        "\n",
+        "# What are the </w> symbols?\n",
+        "\n",
+        "The whitespace symbol indicate if the tokenized item ends with whitespace ( the suffix \"banana</w>\" => \"banana \" )  or not (the prefix  \"post\"  in \"post-apocalyptic \")\n",
+        "\n",
+        "For ease of reference , I call them prefix-tokens and suffix-tokens.\n",
+        "\n",
+        "Sidenote:\n",
+        "\n",
+        "Prefix tokens have the unique property in that they \"mutate\" suffix tokens\n",
+        "\n",
+        "Example:  \"photo of a #prefix#-banana\"\n",
+        "\n",
+        "where #prefix# is a randomly selected prefix-token from the vocab.json\n",
+        "\n",
+        "The hyphen \"-\" exists to guarantee the tokenized text splits into the written #prefix# and #suffix# token respectively.  The \"-\" hypen symbol can be replaced by any other special character of your choosing.\n",
+        "\n",
+        " Capital letters work too , e.g \"photo of a #prefix#Abanana\" since the capital letters A-Z are only listed once in the entire vocab.json.\n",
+        "\n",
+        "You can also choose to omit any separator and just rawdog it with the prompt \"photo of a #prefix#banana\" , however know that  this may , on occasion , be tokenized as completely different tokens of lower ID:s.\n",
+        "\n",
+        "Curiously , common NSFW terms found online have in the CLIP model have been purposefully fragmented into separate #prefix# and #suffix# counterparts in the vocab.json. Likely for PR-reasons.\n",
+        "\n",
+        "You can verify the results using this online tokenizer: https://sd-tokenizer.rocker.boo/\n",
+        "\n",
+        "<div>\n",
+        "<img src=\"https://lemmy.world/pictrs/image/43467d75-7406-4a13-93ca-cdc469f944fc.jpeg\" width=\"300\"/>\n",
+        "<img src=\"https://lemmy.world/pictrs/image/c0411565-0cb3-47b1-a788-b368924d6f17.jpeg\" width=\"300\"/>\n",
+        "<img src=\"https://lemmy.world/pictrs/image/c27c6550-a88b-4543-9bd7-067dff016be2.jpeg\" width=\"300\"/>\n",
+        "</div>\n",
+        "\n",
+        "# What is that gibberish tokens that show up?\n",
+        "\n",
+        "The gibberish tokens like \"ðŁĺħ\\</w>\" are actually emojis!\n",
+        "\n",
+        "Try writing some emojis in this online tokenizer to see the results: https://sd-tokenizer.rocker.boo/\n",
+        "\n",
+        "It is a bit borked as it can't process capital letters properly.\n",
+        "\n",
+        "Also note that this is not reversible.\n",
+        "\n",
+        "If tokenization \"😅\" => ðŁĺħ</w>\n",
+        "\n",
+        "Then you  can't prompt \"ðŁĺħ\" and expect to get the same result as the tokenized original emoji , \"😅\".\n",
+        "\n",
+        "SD 1.5 models actually have training for Emojis.\n",
+        "\n",
+        "But you have to set CLIP skip to 1 for this to work is intended.\n",
+        "\n",
+        "For example, this is the result from \"photo of a 🧔🏻‍♂️\"\n",
+        "\n",
+        "\n",
+        "<div>\n",
+        "<img src=\"https://lemmy.world/pictrs/image/e2b51aea-6960-4ad0-867e-8ce85f2bd51e.jpeg\" width=\"300\"/>\n",
+        "</div>\n",
+        "\n",
+        "A tutorial on stuff you can do with the vocab.list concluded.\n",
+        "\n",
+        "Anyways, have fun with the notebook.\n",
+        "\n",
+        "There might be some updates in the future with features not mentioned here.\n",
         "\n",
+        "//---//"
       ],
       "metadata": {
         "id": "njeJx_nSSA8H"