Specimen5423
/

E621TagAssociations

Model card Files Files and versions Community

Specimen5423 commited on Dec 8, 2023

Commit

8ed5a9b

1 Parent(s): 422c9a7

Add initial data and notebooks

Browse files

Files changed (7) hide show

.gitattributes +3 -0
Building Data e6.ipynb +276 -0
Querying Data.ipynb +464 -0
implications.feather +0 -0
posts_by_tag.feather +3 -0
tags.feather +3 -0
tags_by_post.feather +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+posts_by_tag.feather filter=lfs diff=lfs merge=lfs -text
+tags_by_post.feather filter=lfs diff=lfs merge=lfs -text
+tags.feather filter=lfs diff=lfs merge=lfs -text

Building Data e6.ipynb ADDED Viewed

	@@ -0,0 +1,276 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Preprocesses e621's [data exports](https://e621.net/db_export/) and stores them in feather files. The feather format was chosen because it loads quickly!\n",
+    "\n",
+    "Usage notes:\n",
+    "* Feel free to change `INPUT_FOLDER` and `OUTPUT_FOLDER` to anywhere you want to store your data.\n",
+    "* `DATE` is whatever date is on your input files.\n",
+    "* Files will only be generated if they don't already exist. Delete them if you want to regenerate."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas\n",
+    "import os\n",
+    "from tqdm.notebook import tqdm\n",
+    "tqdm.pandas()\n",
+    "\n",
+    "INPUT_FOLDER = \"H:/Data/TagSuggest/e621_metadata\"\n",
+    "OUTPUT_FOLDER = \"H:/Data/TagSuggest/e621_dataframes\"\n",
+    "DATE = \"2023-08-23\"\n",
+    "\n",
+    "os.makedirs(OUTPUT_FOLDER, exist_ok=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The first thing to process is the tags themselves, since we'll be using their IDs\n",
+    "* `tag_id` - An arbitrary number from e621's database. Very useful.\n",
+    "* `name` - The tag!\n",
+    "* `category` - A number to say whether it's an artist, species, and so on. Constants for these are defined elsewhere, this notebook doesn't need to know them.\n",
+    "* `post_count` - The approximate number of posts the tag has. It's not perfectly aligned with the actual post data, but it's close enough for most purposes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tags_file = f\"{OUTPUT_FOLDER}/tags.feather\"\n",
+    "if os.path.exists(tags_file):\n",
+    "    tags = pandas.read_feather(tags_file)\n",
+    "else:\n",
+    "    tags = pandas.read_csv(f\"{INPUT_FOLDER}/tags-{DATE}.csv.gz\", na_values=[], keep_default_na=False).astype({\"name\":\"string\"}).rename(columns={\"id\": \"tag_id\"}).reset_index(drop=True)\n",
+    "    tags.to_feather(tags_file)\n",
+    "tags.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tags_by_name = tags.copy(deep=True)\n",
+    "tags_by_name.set_index(\"name\", inplace=True)\n",
+    "tags_by_name.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This part takes a couple minutes! There are about 4 million posts to go through, and each one has the tags listed in string format, so they have to be parsed and translated to IDs for more compact storage. The progress bar is based on exactly four million posts, which is low now, but it's not worth actually counting the lines. Two dataframes are generated:\n",
+    "\n",
+    "* The posts file contains most of the post data.\n",
+    "    * `post_id` - From e621. Used for linking to the other dataframe.\n",
+    "    * `rating` - Whether the post is safe, questionable, or explicit. Handy if you want to generate SFW wildcards.\n",
+    "    * `score` - The overall user score of the post, if you're curious. Score doesn't necessarily correlate to aesthetic quality; posts can be highly upvoted because of their content or themes irrespective of their art style.\n",
+    "    * `up_score` - The upvote component of the score. Just guessing, but people probably upvote and downvote for totally different reasons, so it could be useful.\n",
+    "    * `down_score` - The downvote component of the score as a negative. If it's big, it's probably an unpopular niche kink or a political meme or something.\n",
+    "* The post tags file stores the links between posts and tags as numbers. It's surprisingly large."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "post_tags_file = f\"{OUTPUT_FOLDER}/post_tags.feather\"\n",
+    "posts_file = f\"{OUTPUT_FOLDER}/posts.feather\"\n",
+    "if os.path.exists(post_tags_file) and os.path.exists(posts_file):\n",
+    "    post_tags = pandas.read_feather(post_tags_file)\n",
+    "    posts = pandas.read_feather(posts_file)\n",
+    "else:\n",
+    "    post_tags_parts = []\n",
+    "    posts_parts = []\n",
+    "    with pandas.read_csv(f\"{INPUT_FOLDER}/posts-{DATE}.csv.gz\", usecols=[\"id\", \"tag_string\", \"is_deleted\", \"is_pending\", \"rating\", \"score\", \"up_score\", \"down_score\"], chunksize=100_000) as reader:\n",
+    "        progress = tqdm(total=4_000_000)\n",
+    "        for posts in reader:\n",
+    "            post_count = len(posts)\n",
+    "            posts: pandas.DataFrame\n",
+    "            posts = posts[posts[\"is_deleted\"] == \"f\"]\n",
+    "            posts = posts[posts[\"is_pending\"] == \"f\"]\n",
+    "            posts = posts.rename(columns={\"id\": \"post_id\"})\n",
+    "            posts_parts.append(posts[[\"post_id\", \"rating\", \"score\", \"up_score\", \"down_score\"]].astype({\"rating\":\"string\"}))\n",
+    "            posts = posts[[\"post_id\", \"tag_string\"]].set_index(\"post_id\")\n",
+    "            posts = posts.apply(lambda x: x.str.split(' ')).explode(\"tag_string\")\n",
+    "            posts = posts.join(tags_by_name, on=\"tag_string\")[[\"tag_id\"]].reset_index()\n",
+    "            post_tags_parts.append(posts[[\"post_id\", \"tag_id\"]])\n",
+    "            progress.update(post_count)\n",
+    "    post_tags = pandas.concat(post_tags_parts)\n",
+    "    post_tags.reset_index(drop=True, inplace=True)\n",
+    "    post_tags.to_feather(post_tags_file)\n",
+    "    posts = pandas.concat(posts_parts)\n",
+    "    posts.reset_index(drop=True, inplace=True)\n",
+    "    posts.to_feather(posts_file)\n",
+    "print(\"\\npost_tags\")\n",
+    "post_tags.info()\n",
+    "print(\"\\nposts\")\n",
+    "posts.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We also generate and store two different ways of looking at the `post_tags` frame, because it's a lot faster to cache this once than to join a many-to-many frame that size for every single query. This can also take a few minutes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "posts_by_tag_file = f\"{OUTPUT_FOLDER}/posts_by_tag.feather\"\n",
+    "if os.path.exists(posts_by_tag_file):\n",
+    "    posts_by_tag = pandas.read_feather(posts_by_tag_file)\n",
+    "else:\n",
+    "    posts_by_tag = post_tags.groupby(\"tag_id\").progress_aggregate(list)\n",
+    "    posts_by_tag.reset_index(inplace=True)\n",
+    "    posts_by_tag.to_feather(posts_by_tag_file)\n",
+    "posts_by_tag.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tags_by_post_file = f\"{OUTPUT_FOLDER}/tags_by_post.feather\"\n",
+    "if os.path.exists(tags_by_post_file):\n",
+    "    tags_by_post = pandas.read_feather(tags_by_post_file)\n",
+    "else:\n",
+    "    tags_by_post = post_tags.groupby(\"post_id\").progress_aggregate(list)\n",
+    "    tags_by_post.reset_index(inplace=True)\n",
+    "    tags_by_post.to_feather(tags_by_post_file)\n",
+    "tags_by_post.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Also make a SFW post tags list, then use it to build a list of tags that only appear in SFW posts. Optional."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "safe_posts_by_tag_file = f\"{OUTPUT_FOLDER}/safe_posts_by_tag.feather\"\n",
+    "if os.path.exists(safe_posts_by_tag_file):\n",
+    "    safe_posts_by_tag = pandas.read_feather(safe_posts_by_tag_file)\n",
+    "else:\n",
+    "    safe_posts_by_tag = post_tags.set_index(\"post_id\").join(posts.set_index(\"post_id\"))\n",
+    "    safe_posts_by_tag = safe_posts_by_tag[safe_posts_by_tag[\"rating\"].isin([\"s\"])].reset_index()\n",
+    "    safe_posts_by_tag = safe_posts_by_tag[[\"tag_id\", \"post_id\"]].groupby(\"tag_id\").progress_aggregate(list)\n",
+    "    safe_posts_by_tag.reset_index(inplace=True)\n",
+    "    safe_posts_by_tag.to_feather(safe_posts_by_tag_file)\n",
+    "safe_posts_by_tag.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "safe_tags_by_post_file = f\"{OUTPUT_FOLDER}/safe_tags_by_post.feather\"\n",
+    "if os.path.exists(safe_tags_by_post_file):\n",
+    "    safe_tags_by_post = pandas.read_feather(safe_tags_by_post_file)\n",
+    "else:\n",
+    "    safe_tags_by_post = post_tags.set_index(\"post_id\").join(posts.set_index(\"post_id\"))\n",
+    "    safe_tags_by_post = safe_tags_by_post[safe_tags_by_post[\"rating\"].isin([\"s\"])].reset_index()\n",
+    "    safe_tags_by_post = safe_tags_by_post[[\"tag_id\", \"post_id\"]].groupby(\"post_id\").progress_aggregate(list)\n",
+    "    safe_tags_by_post.reset_index(inplace=True)\n",
+    "    safe_tags_by_post.to_feather(safe_tags_by_post_file)\n",
+    "safe_tags_by_post.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "safe_tags_file = f\"{OUTPUT_FOLDER}/safe_tags.feather\"\n",
+    "if os.path.exists(safe_tags_file):\n",
+    "    safe_tags = pandas.read_feather(safe_tags_file)\n",
+    "else:\n",
+    "    safe_tags = safe_posts_by_tag.set_index(\"tag_id\").join(tags.set_index(\"tag_id\"), how=\"inner\")\n",
+    "    safe_tags[\"post_count\"] = safe_tags[\"post_id\"].apply(len)\n",
+    "    safe_tags = safe_tags[[\"name\", \"category\", \"post_count\"]]\n",
+    "    safe_tags.reset_index(inplace=True)\n",
+    "    safe_tags.to_feather(safe_tags_file)\n",
+    "safe_tags.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And lastly, parse and store the implications file. Useful for filtering out tag suggestions that are implied by higher scoring ones, and for building the species hierarchy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "implications_file = f\"{OUTPUT_FOLDER}/implications.feather\"\n",
+    "if os.path.exists(implications_file):\n",
+    "    implications = pandas.read_feather(implications_file)\n",
+    "else:\n",
+    "    implications = pandas.read_csv(f\"{INPUT_FOLDER}/tag_implications-{DATE}.csv.gz\")\\\n",
+    "        .join(tags_by_name, on=\"antecedent_name\", how=\"inner\")\\\n",
+    "        .join(tags_by_name, on=\"consequent_name\", rsuffix=\"_con\")\\\n",
+    "        [[\"tag_id\", \"tag_id_con\"]]\\\n",
+    "        .rename(columns={\"tag_id\": \"antecedent_id\", \"tag_id_con\": \"consequent_id\"})\n",
+    "    implications.reset_index(inplace=True,drop=True)\n",
+    "    implications.to_feather(implications_file)\n",
+    "implications.info()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

Querying Data.ipynb ADDED Viewed

	@@ -0,0 +1,464 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is the fun part. After all the definitions are done, a few methods are ready to play with.\n",
+    "* `related_tags` takes the set of posts that have ALL the provided tags and figures out which tags are most strongly correlated with precence of all the tags together.\n",
+    "* `big_list_suggestions` runs the same query for each tag individually and combines the results in an interesting (if probably statistically unsound) way. Since it's running each tag as a separate query, `big_list_suggestions` will work for tag combinations that don't exist in the dataset.\n",
+    "* `test_artist_prompt` is a neat experiment that's described later.\n",
+    "\n",
+    "Most arguments are the same for both functions. Everything except the targets is optional.\n",
+    "* `targets` - A variable length list of tags. Also translates spaces and backslashed parens if you copy a prompt in as a single string, but each tag still has to be comma separated. It'll throw out anything that doesn't parse as a known tag.\n",
+    "* `exclude` - Specific to `related_tags`. Given a list of tag names, excludes posts that contain these tags.\n",
+    "* `minus` - The `big_list_suggestions` version. Subtracts the correlations of these tags after processing the positive ones.\n",
+    "* `category` - One of the `CAT_*` constants, to filter it down to only specific types of tags, for example `CAT_ARTIST` to get a list of artists correlated with a tag.\n",
+    "* `samples` - When querying big tags, limit the number of posts. The default is 100,000, which is high enough to have very little randomness in the results and low enough to be relatively fast.\n",
+    "* `min_posts` - Don't show tags with fewer than this many total posts. Default is 20.\n",
+    "* `min_overlap` - Don't show tags with fewer than this many posts overlapping the targets. Specific to `related_tags`. Default is 5.\n",
+    "* `top` - Show this many tags with the highest correlation from the result set. Default 30.\n",
+    "* `bottom` - Just for fun, show the this many tags with the lowest correlation. Default 0.\n",
+    "* `exclude_implied` - Don't show tags that are directly implied by tags you already have (according to e621). Default True because it produces boring obvious answers, turn it off if you need to know extra tags to reinforce something."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas\n",
+    "import numpy\n",
+    "import pandas.io.formats.style\n",
+    "import random\n",
+    "import functools\n",
+    "from typing import Literal\n",
+    "\n",
+    "SOURCE: Literal[\"danbooru\", \"e621\"] = \"e621\"\n",
+    "DATA_FOLDER = \"H:/Data/TagSuggest/e621_dataframes\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Yes, I've done this for some scraped Danbooru data too. The results aren't half as interesting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if SOURCE == \"e621\":\n",
+    "    CAT_GENERAL = 0\n",
+    "    CAT_ARTIST = 1\n",
+    "    CAT_UNUSED = 2\n",
+    "    CAT_COPYRIGHT = 3\n",
+    "    CAT_CHARACTER = 4\n",
+    "    CAT_SPECIES = 5\n",
+    "    CAT_INVALID = 6\n",
+    "    CAT_META = 7\n",
+    "    CAT_LORE = 8\n",
+    "\n",
+    "    CATEGORY_COLORS = {\n",
+    "        CAT_GENERAL: \"#b4c7d9\",\n",
+    "        CAT_ARTIST: \"#f2ac08\",\n",
+    "        CAT_UNUSED: \"#ff3d3d\",\n",
+    "        CAT_COPYRIGHT: \"#d0d\",\n",
+    "        CAT_CHARACTER: \"#0a0\",\n",
+    "        CAT_SPECIES: \"#ed5d1f\",\n",
+    "        CAT_INVALID: \"#ff3d3d\",\n",
+    "        CAT_META: \"#fff\",\n",
+    "        CAT_LORE: \"#282\"\n",
+    "    }\n",
+    "elif SOURCE == \"danbooru\":\n",
+    "    CAT_GENERAL = 0\n",
+    "    CAT_ARTIST = 1\n",
+    "    CAT_UNUSED = 2\n",
+    "    CAT_COPYRIGHT = 3\n",
+    "    CAT_CHARACTER = 4\n",
+    "    CAT_META = 5\n",
+    "\n",
+    "    CATEGORY_COLORS = {\n",
+    "        CAT_GENERAL: \"#b4c7d9\",\n",
+    "        CAT_ARTIST: \"#f2ac08\",\n",
+    "        CAT_UNUSED: \"#ff3d3d\",\n",
+    "        CAT_COPYRIGHT: \"#d0d\",\n",
+    "        CAT_CHARACTER: \"#0a0\",\n",
+    "        CAT_META: \"#fff\",\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load everything and arrange the tags to be queried by name or id."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tags = pandas.read_feather(f\"{DATA_FOLDER}/tags.feather\")\n",
+    "posts_by_tag = pandas.read_feather(f\"{DATA_FOLDER}/posts_by_tag.feather\").set_index(\"tag_id\")\n",
+    "tags_by_post = pandas.read_feather(f\"{DATA_FOLDER}/tags_by_post.feather\").set_index(\"post_id\")\n",
+    "implications = pandas.read_feather(f\"{DATA_FOLDER}/implications.feather\")\n",
+    "tags_by_name = tags.copy(deep=True)\n",
+    "tags_by_name.set_index(\"name\", inplace=True)\n",
+    "tags.set_index(\"tag_id\", inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now define the functions themselves. I should document these later, but I want to get this out first."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@functools.cache\n",
+    "def get_related_tags(targets: tuple[str, ...], exclude: tuple[str, ...] = (), samples: int = 100_000) -> pandas.DataFrame:\n",
+    "    these_tags = tags_by_name.loc[list(targets)]\n",
+    "    posts_with_these_tags = posts_by_tag.loc[these_tags[\"tag_id\"]].applymap(set).groupby(lambda x: True).agg(lambda x: set.intersection(*x))[\"post_id\"][True]\n",
+    "    if (len(exclude) > 0):\n",
+    "        excluded_tags = tags_by_name.loc[list(exclude)]\n",
+    "        posts_with_excluded_tags = posts_by_tag.loc[excluded_tags[\"tag_id\"]].applymap(set).groupby(lambda x: True).agg(lambda x: set.union(*x))[\"post_id\"][True]\n",
+    "        posts_with_these_tags = posts_with_these_tags - posts_with_excluded_tags\n",
+    "    total_post_count_together = len(posts_with_these_tags)\n",
+    "    sample_posts = random.sample(list(posts_with_these_tags), samples) if total_post_count_together > samples else list(posts_with_these_tags)\n",
+    "    post_count_together = len(sample_posts)\n",
+    "    sample_ratio = post_count_together / total_post_count_together\n",
+    "    tags_in_these_posts = tags_by_post.loc[sample_posts]\n",
+    "    counts_in_these_posts = tags_in_these_posts[\"tag_id\"].explode().value_counts().rename(\"overlap\")\n",
+    "    summaries = pandas.DataFrame(counts_in_these_posts).join(tags[tags[\"post_count\"]>0], how=\"right\").fillna(0)\n",
+    "    summaries[\"overlap\"] = numpy.minimum(summaries[\"overlap\"] / sample_ratio, summaries[\"post_count\"])\n",
+    "    summaries = summaries[[\"category\", \"name\", \"overlap\", \"post_count\"]]\n",
+    "    # Old \"interestingness\" value, didn't give as good results as an actual statistical technique, go figure. Code kept for curiosity's sake.\n",
+    "    #summaries[\"interestingness\"] = summaries[\"overlap\"].pow(2) / (total_post_count_together * summaries[\"post_count\"])\n",
+    "    # Phi coefficient stuff.\n",
+    "    n = float(len(tags_by_post))\n",
+    "    n11 = summaries[\"overlap\"]\n",
+    "    n1x = float(total_post_count_together)\n",
+    "    nx1 = summaries[\"post_count\"].astype(\"float64\")\n",
+    "    summaries[\"correlation\"] = (n * n11 - n1x * nx1) / numpy.sqrt(n1x * nx1 * (n - n1x) * (n - nx1))\n",
+    "    return summaries\n",
+    "\n",
+    "def format_tags(styler: pandas.io.formats.style.Styler):\n",
+    "    styler.apply(lambda row: numpy.where(row.index == \"name\", \"color:\"+CATEGORY_COLORS[row[\"category\"]], \"\"), axis=1)\n",
+    "    styler.hide(level=0)\n",
+    "    styler.hide(\"category\",axis=1)\n",
+    "    if 'overlap' in styler.data:\n",
+    "        styler.format(\"{:.0f}\".format, subset=[\"overlap\"])\n",
+    "    if 'correlation' in styler.data:\n",
+    "        styler.format(\"{:.2f}\".format, subset=[\"correlation\"])\n",
+    "        styler.background_gradient(vmin=-1.0, vmax=1.0, cmap=\"RdYlGn\", subset=[\"correlation\"])\n",
+    "    if 'score' in styler.data:\n",
+    "        styler.format(\"{:.2f}\".format, subset=[\"score\"])\n",
+    "        styler.background_gradient(vmin=-1.0, vmax=1.0, cmap=\"RdYlGn\", subset=[\"score\"])\n",
+    "    return styler\n",
+    "\n",
+    "def related_tags(*targets: str, exclude: tuple[str, ...] = (), category: int = None, samples: int = 100_000, min_overlap: int = 5, min_posts: int = 20, top: int = 30, bottom: int = 0) -> pandas.DataFrame:\n",
+    "    result = get_related_tags(targets, exclude=exclude, samples=samples)\n",
+    "    if category != None:\n",
+    "        result = result[result[\"category\"] == category]\n",
+    "    result = result[~result[\"name\"].isin(targets)]\n",
+    "    result = result[result[\"overlap\"] >= min_overlap]\n",
+    "    result = result[result[\"post_count\"] >= min_posts]\n",
+    "    top_part = result.sort_values(\"correlation\", ascending=False)[:top]\n",
+    "    bottom_part = result.sort_values(\"correlation\", ascending=True)[:bottom].sort_values(\"correlation\", ascending=False)\n",
+    "    return pandas.concat([top_part, bottom_part]).style.pipe(format_tags)\n",
+    "\n",
+    "def implications_for(*subjects: str, seen: set[str] = None):\n",
+    "    if seen is None:\n",
+    "        seen = set()\n",
+    "    for subject in subjects:\n",
+    "        found = tags.loc[list(implications[implications[\"antecedent_id\"] == tags_by_name.loc[subject, \"tag_id\"]].loc[:,\"consequent_id\"]), \"name\"].values\n",
+    "        for f in found:\n",
+    "            if f in seen:\n",
+    "                pass\n",
+    "            else:\n",
+    "                yield f\n",
+    "                seen.add(f)\n",
+    "                yield from implications_for(f, seen=seen)\n",
+    "\n",
+    "# Simplified version of related_tags that sorts by overlap, especially for reinforcing a prompt with redundant tags.\n",
+    "def tags_for(*targets: str, exclude: tuple[str, ...] = (), category: int = None, samples: int = 100_000, min_overlap: int = 0, min_posts: int = 0, top: int = 30, bottom: int = 0) -> pandas.DataFrame:\n",
+    "    result = get_related_tags(targets, exclude=exclude, samples=samples)\n",
+    "    if category != None:\n",
+    "        result = result[result[\"category\"] == category]\n",
+    "    result = result[~result[\"name\"].isin(targets)]\n",
+    "    result = result[result[\"overlap\"] >= min_overlap]\n",
+    "    result = result[result[\"post_count\"] >= min_posts]\n",
+    "    top_part = result.sort_values(\"overlap\", ascending=False)[:top]\n",
+    "    bottom_part = result.sort_values(\"overlap\", ascending=True)[:bottom].sort_values(\"overlap\", ascending=False)\n",
+    "    return pandas.concat([top_part, bottom_part]).style.pipe(format_tags)\n",
+    "\n",
+    "def parse_tags(*parts: str):\n",
+    "    for part in parts:\n",
+    "        for potential_tag in part.split(\",\"):\n",
+    "            potential_tag = potential_tag.strip().replace(\" \", \"_\").replace(\"\\\\(\", \"(\").replace(\"\\\\)\", \")\")\n",
+    "            if potential_tag == \"\":\n",
+    "                pass\n",
+    "            elif potential_tag in tags_by_name.index:\n",
+    "                yield potential_tag\n",
+    "            else:\n",
+    "                print(\"Couldn't find tag '{potential_tag}', skipping it.\")\n",
+    "\n",
+    "def add_suggestions(suggestions: pandas.DataFrame, new_tags: str | list[str], multiplier: int, samples : int, min_posts: int):\n",
+    "    if isinstance(new_tags, str):\n",
+    "        new_tags = [new_tags]\n",
+    "    for new_tag in new_tags:\n",
+    "        related = get_related_tags((new_tag,), samples=samples)\n",
+    "        related = related[related[\"post_count\"] >= min_posts]\n",
+    "        if suggestions is None:\n",
+    "            suggestions = related.rename(columns={\"correlation\": \"score\"})\n",
+    "        else:\n",
+    "            suggestions = suggestions.join(related, rsuffix=\"r\")\n",
+    "            # This is a totally made up way to combine correlations. It keeps them from going outside the +/- 1 range, which is nice. It also makes older\n",
+    "            # tags less important every time newer ones are added. That could be considered a feature or not.\n",
+    "            suggestions[\"score\"] = numpy.real(numpy.power((numpy.sqrt(suggestions[\"score\"] + 0j) + numpy.sqrt(multiplier * suggestions[\"correlation\"] + 0j)) / 2, 2))\n",
+    "    return suggestions[[\"category\", \"name\", \"post_count\", \"score\"]]\n",
+    "\n",
+    "def big_list_suggestions(*targets: str, minus: list[str] = [], category: int = None, samples: int = 100_000, min_posts: int = 20, top: int = 30, bottom: int = 0, exclude_implied: bool = True):\n",
+    "    suggestions = None\n",
+    "    parsed_targets = list(parse_tags(*targets))\n",
+    "    for target in parsed_targets:\n",
+    "        suggestions = add_suggestions(suggestions, target, 1, samples, min_posts)\n",
+    "    parsed_minus = list(parse_tags(*minus))\n",
+    "    for target in parsed_minus:\n",
+    "        suggestions = add_suggestions(suggestions, target, -1, samples, min_posts)\n",
+    "    if category != None:\n",
+    "        suggestions = suggestions[suggestions[\"category\"] == category]\n",
+    "    suggestions = suggestions[~suggestions[\"name\"].isin(parsed_targets)]\n",
+    "    if exclude_implied:\n",
+    "        exclude = list(implications_for(*parsed_targets)) + list(implications_for(*parsed_minus))\n",
+    "        suggestions = suggestions[~suggestions[\"name\"].isin(exclude)]\n",
+    "    suggestions = suggestions[suggestions[\"post_count\"] >= min_posts]\n",
+    "    top_part = suggestions.sort_values(\"score\", ascending=False)[:top]\n",
+    "    bottom_part = suggestions.sort_values(\"score\", ascending=True)[:bottom].sort_values(\"score\", ascending=False)\n",
+    "    return pandas.concat([top_part, bottom_part]).style.pipe(format_tags)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And now a bonus. This class can build an entire prompt from a starting tag or tags. Previous tags are used to pick new tags based on their correlations. The output is formatted to be pasted directly into the \"Prompts from file or textbox\" script in automatic1111's webui. Some examples are with the others below. Methods:\n",
+    "* `focus` - Push future choices toward things associated with the given tag. Use this to give the builder hints about things you want to see.\n",
+    "* `include` - Like `focus`, but also adds the tag to the positive prompt.\n",
+    "* `avoid` - Push future choices _away_ from things associated with the given tag. Use this to help the builder avoid making unwanted stuff by defining what's unwanted.\n",
+    "* `exclulde` - Like `avoid`, but also adds the tag to the negative prompt.\n",
+    "* `pick` - Grab a few tags from the top X of the given category, based on associations with tags already in the prompt, and add them to the list. The top lists are recalculated with each tag selected.\n",
+    "* `foreach_pick` - Grab a few tags and branch off a new prompt for each one.\n",
+    "* `pick_fast` - Instead of picking one tag at a time and recalculating the top lists each time, pick X tags from the current list. This is especially fast if it's the last step before building, because suggestions won't have to be recalculated for all those new tags.\n",
+    "* `branch` - Create X different prompts at this point without picking any new tags.\n",
+    "\n",
+    "TODO\n",
+    "* Defer everything until build is called, so a progress bar can be calculated for the entire process.\n",
+    "* Try weighing options by correlation instead of blindly picking from the top X. It worked great for the species randomizer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import Callable\n",
+    "\n",
+    "\n",
+    "def pick_tags(suggestions: pandas.DataFrame, category: int, count: int, from_top: int, excluding: list[str], weighted: bool = True):\n",
+    "    options = suggestions[(True if category is None else suggestions[\"category\"] == category) & (suggestions[\"score\"] > 0) & ~suggestions[\"name\"].isin(excluding)].sort_values(\"score\", ascending=False)[:from_top]\n",
+    "    if weighted:\n",
+    "        values = list(options[\"name\"].values)\n",
+    "        weights = list(options[\"score\"].values)\n",
+    "        choices = []\n",
+    "        for _ in range(count):\n",
+    "            choice = random.choices(population=values, weights=weights, k=1)[0]\n",
+    "            weights.pop(values.index(choice))\n",
+    "            values.remove(choice)\n",
+    "            choices.append(choice)\n",
+    "        return choices\n",
+    "    else:\n",
+    "        return random.sample(list(options[\"name\"].values), count)\n",
+    "\n",
+    "def tag_to_prompt(tag: str) -> str:\n",
+    "    if (tags_by_name.loc[tag][\"category\"] == CAT_ARTIST):\n",
+    "        tag = \"by \" + tag\n",
+    "    return tag.replace(\"_\", \" \").replace(\"(\" , \"\\\\(\").replace(\")\" , \"\\\\)\")\n",
+    "\n",
+    "# A lambda in a for loop doesn't capture variables the way I want it to, so this is a method now\n",
+    "def add_suggestions_later(suggestions: pandas.DataFrame, new_tags: str | list[str], multiplier: int, samples: int, min_posts: int):\n",
+    "    return lambda: add_suggestions(suggestions, new_tags, multiplier, samples, min_posts)\n",
+    "\n",
+    "\n",
+    "Prompt = tuple[list[str], list[str], Callable[[], pandas.DataFrame]]\n",
+    "\n",
+    "class PromptBuilder:\n",
+    "    prompts: list[Prompt]\n",
+    "    samples: int\n",
+    "    min_posts: int\n",
+    "    skip_list: list[str]\n",
+    "\n",
+    "    def __init__(self, prompts = [([],[],lambda: None)], skip=[], samples = 100_000, min_posts = 20):\n",
+    "        self.prompts = prompts\n",
+    "        self.samples = samples\n",
+    "        self.min_posts = min_posts\n",
+    "        self.skip_list = skip\n",
+    "\n",
+    "    def include(self, tag: str):\n",
+    "        return PromptBuilder(prompts=[\n",
+    "            (tag_list + [tag], negative_list, add_suggestions_later(suggestions(), tag, 1, self.samples, self.min_posts))\n",
+    "            for (tag_list, negative_list, suggestions) in self.prompts\n",
+    "        ], samples=self.samples, min_posts=self.min_posts, skip=self.skip_list)\n",
+    "\n",
+    "    def focus(self, tag: str):\n",
+    "        return PromptBuilder(prompts=[\n",
+    "            (tag_list, negative_list, add_suggestions_later(suggestions(), tag, 1, self.samples, self.min_posts))\n",
+    "            for (tag_list, negative_list, suggestions) in self.prompts\n",
+    "        ], samples=self.samples, min_posts=self.min_posts, skip=self.skip_list)\n",
+    "\n",
+    "    def exclude(self, tag: str):\n",
+    "        return PromptBuilder(prompts=[\n",
+    "            (tag_list, negative_list + [tag], add_suggestions_later(suggestions(), tag, -1, self.samples, self.min_posts))\n",
+    "            for (tag_list, negative_list, suggestions) in self.prompts\n",
+    "        ], samples=self.samples, min_posts=self.min_posts, skip=self.skip_list)\n",
+    "\n",
+    "    def avoid(self, tag: str):\n",
+    "        return PromptBuilder(prompts=[\n",
+    "            (tag_list, negative_list, add_suggestions_later(suggestions(), tag, -1, self.samples, self.min_posts))\n",
+    "            for (tag_list, negative_list, suggestions) in self.prompts\n",
+    "        ], samples=self.samples, min_posts=self.min_posts, skip=self.skip_list)\n",
+    "\n",
+    "    def pick(self, category: int, count: int, from_top: int):\n",
+    "        new_prompts = self.prompts\n",
+    "        for _ in range(count):\n",
+    "            new_prompts = [\n",
+    "                (tag_list + [tag], negative_list, add_suggestions_later(s, tag, 1, self.samples, self.min_posts))\n",
+    "                for (tag_list, negative_list, suggestions) in new_prompts\n",
+    "                for s in (suggestions(),)\n",
+    "                for tag in pick_tags(s, category, 1, from_top, tag_list + negative_list + self.skip_list)\n",
+    "            ]\n",
+    "        return PromptBuilder(new_prompts, samples=self.samples, min_posts=self.min_posts, skip=self.skip_list)\n",
+    "\n",
+    "    def foreach_pick(self, category: int, count: int, from_top: int):\n",
+    "        return PromptBuilder(prompts=[\n",
+    "            (tag_list + [tag], negative_list, add_suggestions_later(s, tag, 1, self.samples, self.min_posts))\n",
+    "            for (tag_list, negative_list, suggestions) in self.prompts\n",
+    "            for s in (suggestions(),)\n",
+    "            for tag in pick_tags(s, category, count, from_top, tag_list + negative_list + self.skip_list)\n",
+    "        ], samples=self.samples, min_posts=self.min_posts, skip=self.skip_list)\n",
+    "    \n",
+    "    def pick_fast(self, category: int, count: int, from_top: int):\n",
+    "        prompts = []\n",
+    "        for (tag_list, negative_list, suggestions) in self.prompts:\n",
+    "            s = suggestions()\n",
+    "            new_tags = pick_tags(s, category, count, from_top, tag_list + negative_list + self.skip_list)\n",
+    "            prompts.append((tag_list + new_tags, negative_list, add_suggestions_later(s, new_tags, 1, self.samples, self.min_posts)))\n",
+    "        return PromptBuilder(prompts=prompts, samples=self.samples, min_posts=self.min_posts, skip=self.skip_list)\n",
+    "\n",
+    "    def branch(self, count: int):\n",
+    "        return PromptBuilder(prompts=[prompt for prompt in self.prompts for _ in range(count)], samples=self.samples, min_posts=self.min_posts, skip=self.skip_list)\n",
+    "\n",
+    "    def build(self):\n",
+    "        for (tag_list, negative_list, _) in self.prompts:\n",
+    "            positive_prompt = \", \".join([ tag_to_prompt(tag) for tag in tag_list])\n",
+    "            negative_prompt = \", \".join([ tag_to_prompt(tag) for tag in negative_list])\n",
+    "            if negative_prompt:\n",
+    "                yield \"--prompt \\\"{positive_prompt}\\\" --negative_prompt \\\"{negative_prompt}\\\"\"\n",
+    "            else:\n",
+    "                yield positive_prompt\n",
+    "\n",
+    "    def print(self):\n",
+    "        for prompt in self.build():\n",
+    "            print(prompt)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And there you go. Have fun!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Examples\n",
+    "\n",
+    "Get general tags related to vikings:\n",
+    "\n",
+    "```related_tags(\"viking\", category=CAT_GENERAL, min_posts=20)```\n",
+    "\n",
+    "Find artists who draw cheetahs:\n",
+    "\n",
+    "```related_tags(\"cheetah\", category=CAT_ARTIST, min_posts=20)```\n",
+    "\n",
+    "Arm an avali:\n",
+    "\n",
+    "```big_list_suggestions(\"avali, holding weapon, science fiction\", category=CAT_GENERAL)```\n",
+    "\n",
+    "# Prompt Builder Examples\n",
+    "\n",
+    "Use with caution, it has all of e621 to pick from...\n",
+    "\n",
+    "Start with an artist, pick four of the artist's top species, and build a quick prompt for each of them.\n",
+    "\n",
+    "```PromptBuilder().include(\"red-izak\").foreach_pick(CAT_SPECIES, 4, 20).pick_fast(CAT_GENERAL, 10, 20).print()```\n",
+    "\n",
+    "Same thing as above, but this time build the prompt much more slowly by reconsidering the list after every tag chosen. Should create more sensible prompts. Should.\n",
+    "\n",
+    "```PromptBuilder().include(\"red-izak\").foreach_pick(CAT_SPECIES, 4, 10).pick(CAT_GENERAL, 10, 20).print()```\n",
+    "\n",
+    "Start with an overall scene idea, grab four artists that might do it especially well, add two characters, and augment with a few more tags.\n",
+    "\n",
+    "```PromptBuilder().include(\"beach\").include(\"volleyball\").avoid(\"sex\").foreach_pick(CAT_ARTIST, 4, 10).pick_fast(CAT_SPECIES, 2, 10).pick_fast(CAT_GENERAL, 10, 40).print()```\n",
+    "\n",
+    "Make a Halloween prompt! The skips are too-generic tags that tend to make the prompt wander away from the Halloween theme.\n",
+    "\n",
+    "```PromptBuilder(skip=[\"holidays\", \"costume\", \"food\"]).include(\"halloween\").foreach_pick(CAT_ARTIST, 4, 100).pick(None, 10, 40).print()```"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

implications.feather ADDED Viewed

Binary file (277 kB). View file

posts_by_tag.feather ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:86bae28a897c20647688703601520ac1182a6605d9e4f0f0c47c59331c22c379
+size 720730674

tags.feather ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4541d14a57d3ead9634f79cdb544a8eb3d4aae43ebfdaad84c5457a68380caf2
+size 25311978

tags_by_post.feather ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ed5a997d7a3c6c8eaebcf99d65c77358dde7759b0c58dad1d6faeafba7f4265
+size 495786722