FoodDesert commited on
Commit
0e02b5f
1 Parent(s): 38b3693

Upload 5 files

Browse files
README.md CHANGED
@@ -7,49 +7,6 @@ sdk: gradio
7
  sdk_version: 4.19.1
8
  app_file: app.py
9
  pinned: false
10
- tags:
11
- - not-for-all-audience
12
  ---
13
 
14
-
15
- ## Frequently Asked Questions (FAQs)
16
-
17
- Technically I am writing this before anyone but me has used the tool, so no one has asked questions yet. But if they did, here are the questions I think they might ask:
18
-
19
- ### Why is this space tagged "not-for-all-audience"
20
-
21
- The "not-for-all-audience" tag informs users that this tool's text output is derived from e621.net data for tag prediction and completion. This measure underscores a commitment to responsible content sharing.
22
-
23
- ### Does input order matter?
24
-
25
- No
26
-
27
- ### Should I use underscores in the input tags?
28
-
29
- It doesn't matter. The application handles tags either way.
30
-
31
- ### Why are some valid tags marked as "unseen", and why don't some artists ever get returned?
32
-
33
- Some data is excluded from consideration if it did not occur frequently enough in the sample from which the application makes its calculations.
34
- If an artist or tag is too infrequent, we might not think we have enough data to make predictions about it.
35
-
36
- ### Are there any special tags?
37
-
38
- Yes. We normalized the favorite counts of each image to a range of 0-9, with 0 being the lowest favcount, and 9 being the highest.
39
- You can include any of these special tags: "score:0", "score:1", "score:2", "score:3", "score:4", "score:5", "score:6", "score:7", "score:8", "score:9"
40
- in your list to bias the output toward artists with higher or lower scoring images.
41
-
42
- ### Are there any other special tricks?
43
-
44
- Yes. If you want to more strongly bias the artist output toward a specific tag, you can just list it multiple times.
45
- So for example, the query "red fox, red fox, red fox, score:7" will yield a list of artists who are more strongly associated with the tag "red fox"
46
- than the query "red fox, score:7".
47
-
48
- ### What calculation is this thing actually performing?
49
-
50
- Each artist is represented by a "pseudo-document" composed of all the tags from their uploaded images, treating these tags similarly to words in a text document.
51
- Similarly, when you input a set of tags, the system creates a pseudo-document for your query out of all the tags.
52
- It then uses a technique called cosine similarity to compare your tags against each artist's collection, essentially finding which artist's tags are most "similar" to yours.
53
- This method helps identify artists whose work is closely aligned with the themes or elements you're interested in.
54
- For those curious about the underlying mechanics of comparing text-like data, we employ the TF-IDF (Term Frequency-Inverse Document Frequency) method, a standard approach in information retrieval.
55
- You can read more about TF-IDF on its [Wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
 
7
  sdk_version: 4.19.1
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -4,6 +4,11 @@ import numpy as np
4
  from joblib import load
5
  import h5py
6
  from io import BytesIO
 
 
 
 
 
7
 
8
 
9
  faq_content="""
@@ -59,13 +64,71 @@ with h5py.File('complete_artist_data.hdf5', 'r') as f:
59
 
60
  # Load artist names and decode to strings
61
  artist_names = [name.decode() for name in f['artist_names'][:]]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  def find_similar_artists(new_tags_string, top_n):
64
- #
65
  new_image_tags = [tag.replace('_', ' ').strip() for tag in new_tags_string.split(",")]
66
- unseen_tags = set(new_image_tags) - set(vectorizer.vocabulary_.keys())
67
- unseen_tags_str = f'Unseen Tags: {", ".join(unseen_tags)}' if unseen_tags else 'No unseen tags.'
68
-
69
  X_new_image = vectorizer.transform([','.join(new_image_tags)])
70
  similarities = cosine_similarity(X_new_image, X_artist)[0]
71
 
@@ -75,7 +138,8 @@ def find_similar_artists(new_tags_string, top_n):
75
  top_artists_str = "\n".join([f"{rank+1}. {artist[3:]} ({score:.4f})" for rank, (artist, score) in enumerate(top_artists)])
76
  dynamic_prompts_formatted_artists = "{" + "|".join([artist for artist, _ in top_artists]) + "}"
77
 
78
- return unseen_tags_str, top_artists_str, dynamic_prompts_formatted_artists
 
79
 
80
  iface = gr.Interface(
81
  fn=find_similar_artists,
@@ -84,7 +148,7 @@ iface = gr.Interface(
84
  gr.Slider(minimum=1, maximum=100, value=10, step=1, label="Number of artists")
85
  ],
86
  outputs=[
87
- gr.Textbox(label="Unseen Tags", info="These tags are not used in the artist calculation. Even valid e6 tags may be \"unseen\" if they have insufficient data."),
88
  gr.Textbox(label="Top Artists", info="These are the artists most strongly associated with your tags. The number in parenthes is a similarity score between 0 and 1, with higher numbers indicating greater similarity."),
89
  gr.Textbox(label="Dynamic Prompts Format", info="For if you're using the Automatic1111 webui (https://github.com/AUTOMATIC1111/stable-diffusion-webui) with the Dynamic Prompts extension activated (https://github.com/adieyal/sd-dynamic-prompts) and want to try them all individually.")
90
  ],
 
4
  from joblib import load
5
  import h5py
6
  from io import BytesIO
7
+ import csv
8
+ import re
9
+ import random
10
+ import compress_fasttext
11
+ from collections import OrderedDict
12
 
13
 
14
  faq_content="""
 
64
 
65
  # Load artist names and decode to strings
66
  artist_names = [name.decode() for name in f['artist_names'][:]]
67
+
68
+ def clean_tag(tag):
69
+ return ''.join(char for char in tag if ord(char) < 128)
70
+
71
+ #Normally returns tag to aliases, but when reverse=True, returns alias to tags
72
+ def build_aliases_dict(filename, reverse=False):
73
+ aliases_dict = {}
74
+ with open(filename, 'r', newline='', encoding='utf-8') as csvfile:
75
+ reader = csv.reader(csvfile)
76
+ for row in reader:
77
+ tag = clean_tag(row[0])
78
+ alias_list = [] if row[3] == "null" else [clean_tag(alias) for alias in row[3].split(',')]
79
+ if reverse:
80
+ for alias in alias_list:
81
+ aliases_dict.setdefault(alias, []).append(tag)
82
+ else:
83
+ aliases_dict[tag] = alias_list
84
+ return aliases_dict
85
+
86
+
87
+ def find_similar_tags(test_tags):
88
+
89
+ #Initialize stuff
90
+ if not hasattr(find_similar_tags, "fasttext_small_model"):
91
+ find_similar_tags.fasttext_small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load('e621FastTextModel010Replacement_small.bin')
92
+ tag_aliases_file = 'fluffyrock_3m.csv'
93
+ if not hasattr(find_similar_tags, "tag2aliases"):
94
+ find_similar_tags.tag2aliases = build_aliases_dict(tag_aliases_file)
95
+ if not hasattr(find_similar_tags, "alias2tags"):
96
+ find_similar_tags.alias2tags = build_aliases_dict(tag_aliases_file, reverse=True)
97
+
98
+
99
+ # Find similar tags and prepare data for dataframe.
100
+ results_data = []
101
+ for tag in test_tags:
102
+ similar_words = find_similar_tags.fasttext_small_model.most_similar(tag)
103
+ result, seen = [], set()
104
+ if tag in find_similar_tags.tag2aliases:
105
+ result.append((tag, 1))
106
+ seen.add(tag)
107
+ else:
108
+ for item in similar_words:
109
+ similar_word, similarity = item
110
+ if similar_word not in seen:
111
+ if similar_word in find_similar_tags.tag2aliases:
112
+ result.append((similar_word.replace('_', ' '), round(similarity, 3)))
113
+ seen.add(similar_word)
114
+ else:
115
+ for similar_tag in find_similar_tags.alias2tags.get(similar_word, []):
116
+ if similar_tag not in seen:
117
+ result.append((similar_tag.replace('_', ' '), round(similarity, 3)))
118
+ seen.add(similar_tag)
119
+ # Append tag and formatted similar tags to results_data
120
+ for word, sim in result:
121
+ #if word not in seen:
122
+ results_data.append([tag, word, sim])
123
+ #seen.add(word)
124
+
125
+ return results_data # Return list of lists for Dataframe
126
 
127
  def find_similar_artists(new_tags_string, top_n):
 
128
  new_image_tags = [tag.replace('_', ' ').strip() for tag in new_tags_string.split(",")]
129
+ unseen_tags = list(set(OrderedDict.fromkeys(new_image_tags)) - set(vectorizer.vocabulary_.keys()))
130
+ unseen_tags_data = find_similar_tags(unseen_tags) if unseen_tags else [["No unseen tags", "", ""]]
131
+
132
  X_new_image = vectorizer.transform([','.join(new_image_tags)])
133
  similarities = cosine_similarity(X_new_image, X_artist)[0]
134
 
 
138
  top_artists_str = "\n".join([f"{rank+1}. {artist[3:]} ({score:.4f})" for rank, (artist, score) in enumerate(top_artists)])
139
  dynamic_prompts_formatted_artists = "{" + "|".join([artist for artist, _ in top_artists]) + "}"
140
 
141
+ return unseen_tags_data, top_artists_str, dynamic_prompts_formatted_artists
142
+
143
 
144
  iface = gr.Interface(
145
  fn=find_similar_artists,
 
148
  gr.Slider(minimum=1, maximum=100, value=10, step=1, label="Number of artists")
149
  ],
150
  outputs=[
151
+ gr.Dataframe(label="Unseen Tags", headers=["Tag", "Similar Tags"]),
152
  gr.Textbox(label="Top Artists", info="These are the artists most strongly associated with your tags. The number in parenthes is a similarity score between 0 and 1, with higher numbers indicating greater similarity."),
153
  gr.Textbox(label="Dynamic Prompts Format", info="For if you're using the Automatic1111 webui (https://github.com/AUTOMATIC1111/stable-diffusion-webui) with the Dynamic Prompts extension activated (https://github.com/adieyal/sd-dynamic-prompts) and want to try them all individually.")
154
  ],
e621FastTextModel010Replacement_small.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9ade94b75665a92776b73d4bb8871deca566b1b24a0866c0b1d2c56fa7ce68e
3
+ size 15782079
fluffyrock_3m.csv ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt CHANGED
@@ -3,4 +3,4 @@ numpy==1.25.1
3
  scikit-learn==1.2.2
4
  h5py==3.8.0
5
  joblib==1.2.0
6
-
 
3
  scikit-learn==1.2.2
4
  h5py==3.8.0
5
  joblib==1.2.0
6
+ compress-fasttext