Spaces:

FoodDesert
/

Prompt_Squirrel

Running

App Files Files Community

FoodDesert commited on Feb 25, 2024

Commit

0e02b5f

verified ·

1 Parent(s): 38b3693

Upload 5 files

Browse files

Files changed (5) hide show

README.md +1 -44
app.py +70 -6
e621FastTextModel010Replacement_small.bin +3 -0
fluffyrock_3m.csv +0 -0
requirements.txt +1 -1

README.md CHANGED Viewed

@@ -7,49 +7,6 @@ sdk: gradio
 sdk_version: 4.19.1
 app_file: app.py
 pinned: false
-tags:
-- not-for-all-audience
 ---
-## Frequently Asked Questions (FAQs)
-Technically I am writing this before anyone but me has used the tool, so no one has asked questions yet.  But if they did, here are the questions I think they might ask:
-### Why is this space tagged "not-for-all-audience"
-The "not-for-all-audience" tag informs users that this tool's text output is derived from e621.net data for tag prediction and completion.  This measure underscores a commitment to responsible content sharing.
-### Does input order matter?
-No
-### Should I use underscores in the input tags?
-It doesn't matter.  The application handles tags either way.
-### Why are some valid tags marked as "unseen", and why don't some artists ever get returned?
-Some data is excluded from consideration if it did not occur frequently enough in the sample from which the application makes its calculations.
-If an artist or tag is too infrequent, we might not think we have enough data to make predictions about it.
-### Are there any special tags?
-Yes.  We normalized the favorite counts of each image to a range of 0-9, with 0 being the lowest favcount, and 9 being the highest.
-You can include any of these special tags: "score:0", "score:1", "score:2", "score:3", "score:4", "score:5", "score:6", "score:7", "score:8", "score:9"
-in your list to bias the output toward artists with higher or lower scoring images.
-### Are there any other special tricks?
-Yes.  If you want to more strongly bias the artist output toward a specific tag, you can just list it multiple times.
-So for example, the query "red fox, red fox, red fox, score:7" will yield a list of artists who are more strongly associated with the tag "red fox"
-than the query "red fox, score:7".
-### What calculation is this thing actually performing?
-Each artist is represented by a "pseudo-document" composed of all the tags from their uploaded images, treating these tags similarly to words in a text document.
-Similarly, when you input a set of tags, the system creates a pseudo-document for your query out of all the tags.
-It then uses a technique called cosine similarity to compare your tags against each artist's collection, essentially finding which artist's tags are most "similar" to yours.
-This method helps identify artists whose work is closely aligned with the themes or elements you're interested in.
-For those curious about the underlying mechanics of comparing text-like data, we employ the TF-IDF (Term Frequency-Inverse Document Frequency) method, a standard approach in information retrieval.
-You can read more about TF-IDF on its [Wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

 sdk_version: 4.19.1
 app_file: app.py
 pinned: false
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py CHANGED Viewed

@@ -4,6 +4,11 @@ import numpy as np
 from joblib import load
 import h5py
 from io import BytesIO
 faq_content="""
@@ -59,13 +64,71 @@ with h5py.File('complete_artist_data.hdf5', 'r') as f:
     # Load artist names and decode to strings
     artist_names = [name.decode() for name in f['artist_names'][:]]
 def find_similar_artists(new_tags_string, top_n):
-    #
     new_image_tags = [tag.replace('_', ' ').strip() for tag in new_tags_string.split(",")]
-    unseen_tags = set(new_image_tags) - set(vectorizer.vocabulary_.keys())
-    unseen_tags_str = f'Unseen Tags: {", ".join(unseen_tags)}' if unseen_tags else 'No unseen tags.'
     X_new_image = vectorizer.transform([','.join(new_image_tags)])
     similarities = cosine_similarity(X_new_image, X_artist)[0]
@@ -75,7 +138,8 @@ def find_similar_artists(new_tags_string, top_n):
     top_artists_str = "\n".join([f"{rank+1}. {artist[3:]} ({score:.4f})" for rank, (artist, score) in enumerate(top_artists)])
     dynamic_prompts_formatted_artists = "{" + "|".join([artist for artist, _ in top_artists]) + "}"
-    return unseen_tags_str, top_artists_str, dynamic_prompts_formatted_artists
 iface = gr.Interface(
     fn=find_similar_artists,
@@ -84,7 +148,7 @@ iface = gr.Interface(
         gr.Slider(minimum=1, maximum=100, value=10, step=1, label="Number of artists")
     ],
     outputs=[
-        gr.Textbox(label="Unseen Tags", info="These tags are not used in the artist calculation. Even valid e6 tags may be \"unseen\" if they have insufficient data."),
         gr.Textbox(label="Top Artists", info="These are the artists most strongly associated with your tags.  The number in parenthes is a similarity score between 0 and 1, with higher numbers indicating greater similarity."),
         gr.Textbox(label="Dynamic Prompts Format", info="For if you're using the Automatic1111 webui (https://github.com/AUTOMATIC1111/stable-diffusion-webui) with the Dynamic Prompts extension activated (https://github.com/adieyal/sd-dynamic-prompts) and want to try them all individually.")
     ],

 from joblib import load
 import h5py
 from io import BytesIO
+import csv
+import re
+import random
+import compress_fasttext
+from collections import OrderedDict
 faq_content="""
     # Load artist names and decode to strings
     artist_names = [name.decode() for name in f['artist_names'][:]]
+def clean_tag(tag):
+    return ''.join(char for char in tag if ord(char) < 128)
+#Normally returns tag to aliases, but when reverse=True, returns alias to tags
+def build_aliases_dict(filename, reverse=False):
+    aliases_dict = {}
+    with open(filename, 'r', newline='', encoding='utf-8') as csvfile:
+        reader = csv.reader(csvfile)
+        for row in reader:
+            tag = clean_tag(row[0])
+            alias_list = [] if row[3] == "null" else [clean_tag(alias) for alias in row[3].split(',')]
+            if reverse:
+                for alias in alias_list:
+                    aliases_dict.setdefault(alias, []).append(tag)
+            else:
+                aliases_dict[tag] = alias_list
+    return aliases_dict
+def find_similar_tags(test_tags):
+    #Initialize stuff
+    if not hasattr(find_similar_tags, "fasttext_small_model"):
+        find_similar_tags.fasttext_small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load('e621FastTextModel010Replacement_small.bin')
+    tag_aliases_file = 'fluffyrock_3m.csv'
+    if not hasattr(find_similar_tags, "tag2aliases"):
+        find_similar_tags.tag2aliases = build_aliases_dict(tag_aliases_file)
+    if not hasattr(find_similar_tags, "alias2tags"):
+        find_similar_tags.alias2tags = build_aliases_dict(tag_aliases_file, reverse=True)
+    # Find similar tags and prepare data for dataframe.
+    results_data = []
+    for tag in test_tags:
+        similar_words = find_similar_tags.fasttext_small_model.most_similar(tag)
+        result, seen = [], set()
+        if tag in find_similar_tags.tag2aliases:
+            result.append((tag, 1))
+            seen.add(tag)
+        else:
+            for item in similar_words:
+                similar_word, similarity = item
+                if similar_word not in seen:
+                    if similar_word in find_similar_tags.tag2aliases:
+                        result.append((similar_word.replace('_', ' '), round(similarity, 3)))
+                        seen.add(similar_word)
+                    else:
+                        for similar_tag in find_similar_tags.alias2tags.get(similar_word, []):
+                            if similar_tag not in seen:
+                                result.append((similar_tag.replace('_', ' '), round(similarity, 3)))
+                                seen.add(similar_tag)
+        # Append tag and formatted similar tags to results_data
+        for word, sim in result:
+            #if word not in seen:
+            results_data.append([tag, word, sim])
+            #seen.add(word)
+    return results_data  # Return list of lists for Dataframe
 def find_similar_artists(new_tags_string, top_n):
     new_image_tags = [tag.replace('_', ' ').strip() for tag in new_tags_string.split(",")]
+    unseen_tags = list(set(OrderedDict.fromkeys(new_image_tags)) - set(vectorizer.vocabulary_.keys()))
+    unseen_tags_data = find_similar_tags(unseen_tags) if unseen_tags else [["No unseen tags", "", ""]]
     X_new_image = vectorizer.transform([','.join(new_image_tags)])
     similarities = cosine_similarity(X_new_image, X_artist)[0]
     top_artists_str = "\n".join([f"{rank+1}. {artist[3:]} ({score:.4f})" for rank, (artist, score) in enumerate(top_artists)])
     dynamic_prompts_formatted_artists = "{" + "|".join([artist for artist, _ in top_artists]) + "}"
+    return unseen_tags_data, top_artists_str, dynamic_prompts_formatted_artists
 iface = gr.Interface(
     fn=find_similar_artists,
         gr.Slider(minimum=1, maximum=100, value=10, step=1, label="Number of artists")
     ],
     outputs=[
+        gr.Dataframe(label="Unseen Tags", headers=["Tag", "Similar Tags"]),
         gr.Textbox(label="Top Artists", info="These are the artists most strongly associated with your tags.  The number in parenthes is a similarity score between 0 and 1, with higher numbers indicating greater similarity."),
         gr.Textbox(label="Dynamic Prompts Format", info="For if you're using the Automatic1111 webui (https://github.com/AUTOMATIC1111/stable-diffusion-webui) with the Dynamic Prompts extension activated (https://github.com/adieyal/sd-dynamic-prompts) and want to try them all individually.")
     ],

e621FastTextModel010Replacement_small.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a9ade94b75665a92776b73d4bb8871deca566b1b24a0866c0b1d2c56fa7ce68e
+size 15782079

fluffyrock_3m.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt CHANGED Viewed

@@ -3,4 +3,4 @@ numpy==1.25.1
 scikit-learn==1.2.2
 h5py==3.8.0
 joblib==1.2.0

 scikit-learn==1.2.2
 h5py==3.8.0
 joblib==1.2.0
+compress-fasttext