Release V1.1 (448x448 fine-tune, bf16 safetensors + ONNX)

Adds V1.1_safetensors/ and V1.1_onnx/ for the 448x448 fine-tune of V1 (epoch 7, step 85517 from experiments/run1_vit/checkpoints/last.pt). Updates top-level README.md and config.json to list both V1 and V1.1 variants.

Files changed (16) hide show

README.md +30 -9
V1.1_onnx/README.md +259 -0
V1.1_onnx/config.json +31 -0
V1.1_onnx/model.onnx +3 -0
V1.1_onnx/pr_thresholds.json +102 -0
V1.1_onnx/preprocessing.json +54 -0
V1.1_onnx/selected_tags.csv +0 -0
V1.1_onnx/vocabulary.json +0 -0
V1.1_safetensors/README.md +259 -0
V1.1_safetensors/config.json +28 -0
V1.1_safetensors/model.safetensors +3 -0
V1.1_safetensors/pr_thresholds.json +102 -0
V1.1_safetensors/preprocessing.json +41 -0
V1.1_safetensors/selected_tags.csv +0 -0
V1.1_safetensors/vocabulary.json +0 -0
config.json +22 -4

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ tags:
 A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
-**V1** (the from-scratch 320×320 model) is shipping now. **V1.1** (a 448×448 fine-tune of V1) is **planned for release the week of 2026-05-15** and is expected to outperform V1 on this evaluation set; final numbers will be filled in once training and eval are complete. Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
 A live demo is available on the companion Space: [Grio43/OppaiOracle](https://huggingface.co/spaces/Grio43/OppaiOracle).
@@ -34,19 +34,20 @@ A live demo is available on the companion Space: [Grio43/OppaiOracle](https://hu
 | Checkpoint | Native resolution | How it was produced | When to use |
 |---|---|---|---|
-| **V1** *(available now)* | 320×320 | Trained **from scratch** at 320×320. This is the model's native resolution. | Use this today. Also the right pick if you are running inference at 320×320 or if throughput matters. |
-| **V1.1** *(coming \~2026-05-15)* | 448×448 | A **fine-tune of V1** at 448×448. Position embeddings were interpolated from the 20×20 grid to 28×28, optimizer state was reset, and training continued at the new resolution following the FixRes / DeiT III progressive-resolution recipe. | Use when you specifically want 448×448 inference for finer spatial detail (small accessories, eye details). Internal expectation is that V1.1 will outperform V1 on the same eval set, but final numbers will only be published after V1.1 training completes — until then, treat V1 as the reference checkpoint. |
 Two practical notes:
-- **Match input resolution to the checkpoint.** Feeding 448×448 images to V1, or 320×320 images to V1.1 once it ships, will give worse results than matching them. The position-embedding grid is fixed at load time.
-- **V1 is not deprecated by V1.1.** They are siblings with different operating points, not generations of the same model.
-- **Until V1.1 ships, V1 is the only released checkpoint.** Numbers and recommendations in this card refer to V1 unless explicitly labeled V1.1.
 ### Files in this repo
 - `V1_safetensors/` — V1 in `safetensors` format with `config.json` and `preprocessing.json`. Use this for PyTorch / custom inference.
 - `V1_onnx/` — V1 exported to ONNX. Use this for ONNX Runtime inference (CPU, DirectML, CUDA EP).
 - Each variant directory ships `vocabulary.json`, `selected_tags.csv`, `pr_thresholds.json`, and a copy of this README.
 ---
@@ -162,9 +163,29 @@ On my evaluation set this model achieves:
 Note the inversion: rare/mid tags out-score head/very-common tags on mAP. This is consistent with the missing-tag bias described above — high-frequency concepts are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
-### V1.1 headline numbers (pending)
-V1.1 has not finished training yet. Final Macro F1 / Micro F1 / mAP and the per-bucket mAP breakdown will be added here once V1.1 is released (planned for the week of 2026-05-15). Internal expectation is that V1.1 will exceed V1 on this eval set at its native 448×448 input resolution, but no measured numbers are claimed in this document until they exist.
 I want to be honest about *why* I think it performs well: **it is almost certainly not because of a special training regimen.** The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the **input dataset is cleaner** than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.
@@ -174,7 +195,7 @@ I want to be honest about *why* I think it performs well: **it is almost certain
 For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (\~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
-| Augmentation | V1 (320×320, from scratch, 40 epochs planned) | V1.1 (448×448, fine-tune of V1, 15 epochs) |
 |---|---|---|
 | Horizontal flip | p = 0.5 | p = 0.5 |
 | Color jitter — brightness | 0.30 (p = 0.5) | 0.22 (p = 0.5) |

 A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
+Two checkpoints are released here. **V1** is the from-scratch 320×320 model. **V1.1** is a 448×448 fine-tune of V1 and on this evaluation set posts a modest mAP gain over V1 (overall val/mAP 0.674 vs. 0.614, ~+6 points absolute, ~+10% relative). The fine-tune helps across every frequency bucket but does not transform results — both checkpoints inherit the same source-data label noise. Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
 A live demo is available on the companion Space: [Grio43/OppaiOracle](https://huggingface.co/spaces/Grio43/OppaiOracle).
 | Checkpoint | Native resolution | How it was produced | When to use |
 |---|---|---|---|
+| **V1** | 320×320 | Trained **from scratch** at 320×320. This is the model's native resolution. | The right pick if you are running inference at 320×320 or if throughput matters. |
+| **V1.1** | 448×448 | A **fine-tune of V1** at 448×448. Position embeddings were interpolated from the 20×20 grid to 28×28, optimizer state was reset, and training continued at the new resolution following the FixRes / DeiT III progressive-resolution recipe. Trained for 6 of 15 planned epochs and stopped early — see *Performance notes / V1.1 headline numbers* below for the rationale. | Use when you specifically want 448×448 inference for finer spatial detail (small accessories, eye details). V1.1 outperforms V1 on every frequency bucket of this eval set (numbers in the *Performance notes* section), but the gain is modest — V1 remains a fully reasonable choice if you are running at 320×320. |
 Two practical notes:
+- **Match input resolution to the checkpoint.** Feeding 448×448 images to V1, or 320×320 images to V1.1, will give worse results than matching them. The position-embedding grid is fixed at load time.
+- **V1 is not deprecated by V1.1.** They are siblings with different operating points, not generations of the same model. The V1.1 mAP gain over V1 is real but small (~+6 points overall) — pick on resolution, not on the assumption that V1.1 is strictly better.
 ### Files in this repo
 - `V1_safetensors/` — V1 in `safetensors` format with `config.json` and `preprocessing.json`. Use this for PyTorch / custom inference.
 - `V1_onnx/` — V1 exported to ONNX. Use this for ONNX Runtime inference (CPU, DirectML, CUDA EP).
+- `V1.1_safetensors/` — V1.1 in `safetensors` format with `config.json` and `preprocessing.json`.
+- `V1.1_onnx/` — V1.1 exported to ONNX.
 - Each variant directory ships `vocabulary.json`, `selected_tags.csv`, `pr_thresholds.json`, and a copy of this README.
 ---
 Note the inversion: rare/mid tags out-score head/very-common tags on mAP. This is consistent with the missing-tag bias described above — high-frequency concepts are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
+### V1.1 headline numbers (e6/15, Phase 2, 448×448, 19,292 tags)
+| Metric | Value |
+|---|---|
+| Overall val/mAP | 0.674 |
+F1 / P=R numbers are intentionally not reported alongside this row — see the *Why F1 numbers are not reported for V1.1* paragraph below for the calibration reason.
+**mAP broken out by tag frequency bucket — V1 vs. V1.1 on the same eval set:**
+| Frequency bucket | V1 mAP | V1.1 mAP | Δ |
+|---|---|---|---|
+| 500–999 (rare) | 0.589 | 0.645 | +0.056 |
+| 1K–5K (mid) | 0.598 | 0.656 | +0.058 |
+| 5K–10K (head) | 0.535 | 0.595 | +0.060 |
+| 10K+ (very common) | 0.542 | 0.606 | +0.064 |
+| **Overall** | **0.614** | **0.674** | **+0.060** |
+The same rare-vs-head inversion noted for V1 (rare/mid > head/very-common on mAP) is still present in V1.1, and for the same reason — high-frequency tags are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
+**Why V1.1 stopped at 6 of 15 planned epochs.** Per-epoch mAP growth decelerated from ~+0.7%/epoch in early Phase 2 to ~+0.3%/epoch by epoch 5, while validation loss continued to fall and per-tag calibration shifted (mean activations per image dropped from ~4500 at epoch 0 to ~4200 at epoch 5, but the auto-stop F1 metric is calibration-floored at a fixed threshold of 0.2653 and therefore unreliable as a stop signal — see [TRAINING_HEALTH_TRACKER.md](../TRAINING_HEALTH_TRACKER.md)). At that growth rate, the remaining 9 epochs would have been operating in the regime where it is no longer cleanly distinguishable whether mAP gains are *real ranking improvement* or *memorization of the labeled subset of a noisy multi-label corpus* (the missing-positive bias documented earlier in this card sets a soft ceiling somewhere in this neighbourhood). Continuing was unlikely to buy enough real gain to justify the extra training time, so V1.1 ships at the epoch-5 / step-81822 checkpoint.
+**Why F1 numbers are not reported for V1.1.** V1.1's loss configuration (`gamma_neg=7.0`, `clip=0.2`) shifts the logit distribution relative to V1's (`gamma_neg=4.0`, `clip=0.05`). The in-training F1 metric uses a fixed threshold (0.2653) calibrated against V1's distribution, so V1.1's in-training F1 values are calibration-floored and not comparable to V1's reported F1. Reporting them alongside V1's would invite the wrong comparison. mAP, on the other hand, is threshold-independent and the V1 vs. V1.1 mAP comparison above is apples-to-apples — that is the comparison this card stands behind.
 I want to be honest about *why* I think it performs well: **it is almost certainly not because of a special training regimen.** The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the **input dataset is cleaner** than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.
 For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (\~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
+| Augmentation | V1 (320×320, from scratch, 40 epochs planned, 33 trained) | V1.1 (448×448, fine-tune of V1, 15 epochs planned, 6 trained) |
 |---|---|---|
 | Horizontal flip | p = 0.5 | p = 0.5 |
 | Color jitter — brightness | 0.30 (p = 0.5) | 0.22 (p = 0.5) |

V1.1_onnx/README.md ADDED Viewed

	@@ -0,0 +1,259 @@

+---
+license: apache-2.0
+pipeline_tag: image-classification
+language:
+- en
+tags:
+- anime
+- anime-tagger
+- tagger
+- image-tagging
+- multi-label
+- multi-label-classification
+- vision-transformer
+- vit
+- illustration
+- danbooru
+- safetensors
+- onnx
+---
+## TL;DR
+A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
+Two checkpoints are released here. **V1** is the from-scratch 320×320 model. **V1.1** is a 448×448 fine-tune of V1 and on this evaluation set posts a modest mAP gain over V1 (overall val/mAP 0.674 vs. 0.614, ~+6 points absolute, ~+10% relative). The fine-tune helps across every frequency bucket but does not transform results — both checkpoints inherit the same source-data label noise. Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
+A live demo is available on the companion Space: [Grio43/OppaiOracle](https://huggingface.co/spaces/Grio43/OppaiOracle).
+---
+## Variants — which checkpoint should I use?
+| Checkpoint | Native resolution | How it was produced | When to use |
+|---|---|---|---|
+| **V1** | 320×320 | Trained **from scratch** at 320×320. This is the model's native resolution. | The right pick if you are running inference at 320×320 or if throughput matters. |
+| **V1.1** | 448×448 | A **fine-tune of V1** at 448×448. Position embeddings were interpolated from the 20×20 grid to 28×28, optimizer state was reset, and training continued at the new resolution following the FixRes / DeiT III progressive-resolution recipe. Trained for 6 of 15 planned epochs and stopped early — see *Performance notes / V1.1 headline numbers* below for the rationale. | Use when you specifically want 448×448 inference for finer spatial detail (small accessories, eye details). V1.1 outperforms V1 on every frequency bucket of this eval set (numbers in the *Performance notes* section), but the gain is modest — V1 remains a fully reasonable choice if you are running at 320×320. |
+Two practical notes:
+- **Match input resolution to the checkpoint.** Feeding 448×448 images to V1, or 320×320 images to V1.1, will give worse results than matching them. The position-embedding grid is fixed at load time.
+- **V1 is not deprecated by V1.1.** They are siblings with different operating points, not generations of the same model. The V1.1 mAP gain over V1 is real but small (~+6 points overall) — pick on resolution, not on the assumption that V1.1 is strictly better.
+### Files in this repo
+- `V1_safetensors/` — V1 in `safetensors` format with `config.json` and `preprocessing.json`. Use this for PyTorch / custom inference.
+- `V1_onnx/` — V1 exported to ONNX. Use this for ONNX Runtime inference (CPU, DirectML, CUDA EP).
+- `V1.1_safetensors/` — V1.1 in `safetensors` format with `config.json` and `preprocessing.json`.
+- `V1.1_onnx/` — V1.1 exported to ONNX.
+- Each variant directory ships `vocabulary.json`, `selected_tags.csv`, `pr_thresholds.json`, and a copy of this README.
+---
+## How this model came to be
+I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
+- **Removed \~300k incorrect tags** from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
+- **Added \~1,000,000 missing tags** in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.
+That is \~1.3M corrections in total, which is only on the order of **\~3% of the tags in the corpus**. This was a *targeted* pass, not a top-to-bottom relabel. Effort was deliberately concentrated on **low-frequency tags**, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.
+I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to **expand the tag vocabulary by \~20,000 additional low-frequency tags** that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.
+The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.
+---
+## What "cleaned" actually means (and what it does not)
+This is the most important section of this release. The cleaning was real work, but it was not omniscient, and the dataset still has structured, category-level label noise that you will see in the model's outputs. Most of these issues are inherited directly from the **publicly-sourced source datasets** — they are not new noise introduced during cleaning; they are pre-existing patterns that the cleaning pass touched but did not resolve at the category level.
+The categories below are **illustrative, not exhaustive.** Many other tag families show similarly deep-rooted issues. Two failure modes show up across most of them, but they are not equal in size:
+- **Missing tags (by far the dominant problem)** — concepts that are clearly present in an image but were never tagged at the source. This is the single biggest source of noise in the entire dataset. See the dedicated subsection below for the empirical scale.
+- **Wrong tags (not uncommon, but secondary)** — visually similar concepts confused with each other in the source data (the bow / bowtie / ribbon / ascot / necktie cluster, color buckets, length and size buckets). These are real and plentiful, just not the dominant failure mode.
+### Missing tags (the dominant noise mode)
+If you only remember one thing from this section, remember this: **the biggest single problem in the source data is not wrong tags, it is missing tags.** Wrong tags are not uncommon either, but they are dwarfed in volume by labels that should be present and simply aren't.
+A rough empirical sense of the gap, from manual review:
+- A typical image in this dataset arrives with roughly **\~28 tags** from the source.
+- A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have **50+ tags**, often more.
+- During spot-checks I have routinely taken images that arrived with **\~40 tags up past 60 tags** just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.
+So the source tag count is on the order of **half** of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added \~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.
+The training-time consequence is that for every missing-but-present tag, the model receives **no positive gradient at all** for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means **low predicted scores are less informative than they look** — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."
+### Color tags
+Color-named tags (eye color, hair color, general color tags) are **poorly tagged at the source**, and the noise that survived cleaning is dataset-wide. Every color tag in the vocabulary has some version of this problem; some are worse than others.
+- **Obvious failures were cleanable.** A bright, unambiguous yellow mislabeled as `blue_eyes` is exactly the kind of disagreement the AI-assisted pass catches, and those got fixed. The residual noise is not the obvious-failure kind.
+- **The deep-rooted issue is perceptual, not technical.** The category boundaries between color tags are drawn by *human viewers*, not by RGB codes. Different taggers carve up the spectrum differently, and any single color tag in this dataset covers a fairly wide perceptual band of that color. There is no clean RGB threshold I could have used to mechanically separate the categories, which is exactly why manual cleaning at the category level is intractable.
+- **Adjacent / overlapping colors leak into each other in predictable patterns.** Some examples I have observed:
+  - `aqua_*` tags heavily pollute both **blue** and **green** based tags — aqua sits perceptually between them and gets sorted into all three buckets across the corpus.
+  - `yellow_*` tags overlap meaningfully with **red** and **orange** tags — warm-spectrum boundaries are inconsistent in the source data.
+  - Similar patterns exist for purple/blue/pink, brown/orange/red, and black/very-dark-anything.
+- Color tags are also **high frequency**, so the noise is spread across millions of images rather than concentrated where it could be hand-fixed.
+- When I sampled live in-the-wild images and compared the model's predictions to a careful human reading, the same source-data confusion patterns were still present in the predictions. The model is faithfully reproducing the source-data label distribution, which is itself noisy along the color axis.
+### Hair length
+The hair length tags — `very_short_hair`, `short_hair`, `medium_hair`, `long_hair`, `very_long_hair` — all have major boundary issues. `long_hair` and `very_long_hair` are the worst offenders; the source labels routinely disagree with each other across visually similar images. The model inherits this confusion.
+### Other "objective size" body-part tags
+The same problem applies to tags that sound objective but are really continuous and judgement-dependent: `flat_chest`, `small_breasts`, `medium_breasts`, `large_breasts`, `huge_breasts`. These are inherently noisy supervision targets for a classifier — adjacent buckets are not crisply separable in the source data, and the model cannot do better than the labels it was given.
+### Neckwear and small accessories (bows, bowties, ribbons, ascots, neckties)
+This cluster of tags has systemic issues at the source. `bow`, `bowtie`, `ribbon`, `ascot`, and `necktie` are visually similar but distinct accessories, and the public source data routinely confuses them — the same physical object will be tagged differently across images, and adjacent categories leak into each other in both directions. The cleaning pass touched obvious mistakes here but did not normalize the category boundaries; the model learns the same fuzzy boundaries the source data has.
+These five are the cluster I happened to look at closely. Many other small-accessory and clothing-detail tags show the same pattern — visually similar items, fuzzy source-data boundaries, residual confusion in the model. Treat any prediction in this category as a *suggestion* to inspect, not a final answer.
+### Character-vs-concept leakage
+For some tags, the data is dominated by a small number of characters. When that happens, the model tends to learn **the character** rather than **the concept** the tag was meant to represent. Without a curated golden-standard set that deliberately decouples the concept from those characters, this is very hard to fix at training time.
+### My estimate of cleaning quality
+The 300k removals and \~1M additions were **AI-assisted and then human-reviewed by me**. My honest estimate is that the corrections themselves are **<5% error**. That is a statement about the *changes I made*, not about the *underlying dataset* — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.
+---
+## How to use this model responsibly
+- **Human review every output.** This applies most strongly to color, hair length, and size-bucket tags. The model is a fast first pass, not an authoritative labeler.
+- **Treat sibling tags as a group, not a hard pick.** If the model emits `blue_eyes` with high confidence, also check the `purple_eyes` / `aqua_eyes` / `black_eyes` scores before you commit.
+- **Do not use the raw output as ground-truth for downstream training** without manual review. The very confusion patterns that this model can't resolve will get baked into your downstream model.
+- **For thresholding, prefer per-tag thresholds over a single global threshold.** Different tag families have very different precision/recall behavior on this dataset.
+---
+## Performance notes
+On my evaluation set this model achieves:
+- The best **precision-equals-recall** point I have measured among comparable open anime taggers.
+- A solid **mAP** relative to the same comparison set.
+### V1 headline numbers (e27/40, Phase 1, 320×320, 19,292 tags)
+| Metric | Value |
+|---|---|
+| Macro F1 | 0.588 |
+| Micro F1 | 0.659 |
+| P=R threshold (macro / micro) | 0.614 / 0.670 |
+| Overall val/mAP | 0.614 |
+**mAP broken out by tag frequency bucket:**
+| Frequency bucket | mAP |
+|---|---|
+| 500–999 (rare) | 0.589 |
+| 1K–5K (mid) | 0.598 |
+| 5K–10K (head) | 0.535 |
+| 10K+ (very common) | 0.542 |
+Note the inversion: rare/mid tags out-score head/very-common tags on mAP. This is consistent with the missing-tag bias described above — high-frequency concepts are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
+### V1.1 headline numbers (e6/15, Phase 2, 448×448, 19,292 tags)
+| Metric | Value |
+|---|---|
+| Overall val/mAP | 0.674 |
+F1 / P=R numbers are intentionally not reported alongside this row — see the *Why F1 numbers are not reported for V1.1* paragraph below for the calibration reason.
+**mAP broken out by tag frequency bucket — V1 vs. V1.1 on the same eval set:**
+| Frequency bucket | V1 mAP | V1.1 mAP | Δ |
+|---|---|---|---|
+| 500–999 (rare) | 0.589 | 0.645 | +0.056 |
+| 1K–5K (mid) | 0.598 | 0.656 | +0.058 |
+| 5K–10K (head) | 0.535 | 0.595 | +0.060 |
+| 10K+ (very common) | 0.542 | 0.606 | +0.064 |
+| **Overall** | **0.614** | **0.674** | **+0.060** |
+The same rare-vs-head inversion noted for V1 (rare/mid > head/very-common on mAP) is still present in V1.1, and for the same reason — high-frequency tags are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
+**Why V1.1 stopped at 6 of 15 planned epochs.** Per-epoch mAP growth decelerated from ~+0.7%/epoch in early Phase 2 to ~+0.3%/epoch by epoch 5, while validation loss continued to fall and per-tag calibration shifted (mean activations per image dropped from ~4500 at epoch 0 to ~4200 at epoch 5, but the auto-stop F1 metric is calibration-floored at a fixed threshold of 0.2653 and therefore unreliable as a stop signal — see [TRAINING_HEALTH_TRACKER.md](../TRAINING_HEALTH_TRACKER.md)). At that growth rate, the remaining 9 epochs would have been operating in the regime where it is no longer cleanly distinguishable whether mAP gains are *real ranking improvement* or *memorization of the labeled subset of a noisy multi-label corpus* (the missing-positive bias documented earlier in this card sets a soft ceiling somewhere in this neighbourhood). Continuing was unlikely to buy enough real gain to justify the extra training time, so V1.1 ships at the epoch-5 / step-81822 checkpoint.
+**Why F1 numbers are not reported for V1.1.** V1.1's loss configuration (`gamma_neg=7.0`, `clip=0.2`) shifts the logit distribution relative to V1's (`gamma_neg=4.0`, `clip=0.05`). The in-training F1 metric uses a fixed threshold (0.2653) calibrated against V1's distribution, so V1.1's in-training F1 values are calibration-floored and not comparable to V1's reported F1. Reporting them alongside V1's would invite the wrong comparison. mAP, on the other hand, is threshold-independent and the V1 vs. V1.1 mAP comparison above is apples-to-apples — that is the comparison this card stands behind.
+I want to be honest about *why* I think it performs well: **it is almost certainly not because of a special training regimen.** The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the **input dataset is cleaner** than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.
+---
+## Image augmentation settings (V1 and V1.1)
+For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (\~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
+| Augmentation | V1 (320×320, from scratch, 40 epochs planned, 33 trained) | V1.1 (448×448, fine-tune of V1, 15 epochs planned, 6 trained) |
+|---|---|---|
+| Horizontal flip | p = 0.5 | p = 0.5 |
+| Color jitter — brightness | 0.30 (p = 0.5) | 0.22 (p = 0.5) |
+| Color jitter — contrast | 0.20 (p = 0.5) | 0.15 (p = 0.5) |
+| Color jitter — saturation | 0.08 (p = 0.5) | 0.06 (p = 0.5) |
+| Random rotation | p = 0.50, ±[2°, 8°], bicubic | p = 0.30, ±[2°, 5°], bicubic |
+| Gaussian blur | p = 0.30, kernel = 3, σ ∈ [0.1, 1.5] | p = 0.15, kernel = 3, σ ∈ [0.1, 1.0] |
+| Random erasing | disabled | disabled |
+| Normalization (mean / std) | [0.5, 0.5, 0.5] / [0.5, 0.5, 0.5] | [0.5, 0.5, 0.5] / [0.5, 0.5, 0.5] |
+| Letterbox pad color | [114, 114, 114] | [114, 114, 114] |
+Notes on a few of these choices:
+- **Saturation is held well below brightness/contrast** in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (`blue_eyes`, `pink_skin`, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (\~¼ of brightness) is taken from BYOL's asymmetric augmentation.
+- **Rotation is kept on at V1.1**, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
+- **Gaussian blur is also kept on at V1.1** for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
+- **No mixup, no cutmix, no RandAugment, no random erasing** in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.
+---
+## Limitations summary
+| Area | Severity | Notes |
+|---|---|---|
+| Color tags (eye/hair/general) | **High** | Source-data noise survives; sibling colors leak into each other |
+| Hair length (especially `long_hair`, `very_long_hair`) | **High** | Boundary tags inherently noisy in source |
+| Size-bucket body-part tags | **High** | Continuous quantity discretized into noisy buckets |
+| Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`) | **High** | Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern |
+| Missing tags (concept present, no label) | **Dominant** | The single biggest source of noise in the corpus. Typical \~28 tags/image vs. 50+ that should be present. \~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction. |
+| Character-overwhelmed tags | **Medium** | Some tags are learned as proxies for specific characters |
+| Rare / low-frequency tags | **Medium** | The +20k vocabulary expansion helps, but tail tags still see fewer examples |
+| Anything not on the above list | Use with normal caution | The above are illustrative, not exhaustive — many tag families show similar source-data issues |
+---
+## What's next (V2)
+Once a refreshed 2026-vintage source dataset becomes available, I plan to start work on V2. The biggest single change between V1 and V2 will not be the model — it will be **substantially more time spent on data cleaning before training begins**, with a particular focus on:
+- Building a curated **golden-standard slice** for color tags, hair-length tags, and size-bucket tags so those categories can be supervised against deliberately disambiguated examples.
+- Deeper character/concept decoupling so character-overwhelmed tags learn the actual concept.
+- Better measurement of "true" performance on a hand-relabeled validation slice, so the headline metrics are not silently inflated by the same missing-positive bias that affects the training data.
+V1 ships with the noise it ships with. V2 is where I plan to do something about it.
+---
+## Acknowledgments
+- **SmilingWolf** for the ViT v3 tagger, which made the initial cleaning pass tractable. None of this would have been feasible without an existing strong tagger to use as a second opinion.
+- The broader anime-tagger open-source community for the public tag corpora and prior model checkpoints I compared against.
+---
+## License / usage
+Released under the **Apache License 2.0**. You may use, modify, and redistribute the model and accompanying files for personal, research, or commercial purposes, provided you retain the license notice and attribution.
+**Intended use.** Research and downstream tooling for multi-label tagging of anime / illustration imagery.
+**Out-of-scope use.** Decisions about real people; safety-critical pipelines that depend on label correctness without human review; training a downstream model on raw outputs without manual review (the missing-tag bias described above will propagate).

V1.1_onnx/config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "format": "onnx",
+  "architecture_type": "vit",
+  "num_labels": 19294,
+  "num_channels": 3,
+  "image_size": 448,
+  "patch_size": 16,
+  "hidden_size": 1024,
+  "num_hidden_layers": 18,
+  "num_attention_heads": 16,
+  "intermediate_size": 4096,
+  "num_groups": 20,
+  "tags_per_group": 10000,
+  "training_epoch": 7,
+  "training_step": 85517,
+  "vocab_format_version": 1,
+  "vocab_sha256": "ad3c33d3b760bd0d15bd4631f441d47fcb136c7a6e53473b5588d760907b0316",
+  "onnx_opset_version": 20,
+  "onnx_inputs": [
+    "pixel_values",
+    "padding_mask"
+  ],
+  "onnx_outputs": [
+    "probabilities"
+  ],
+  "dynamic_batch": true,
+  "preprocessing_file": "preprocessing.json",
+  "thresholds_file": "pr_thresholds.json",
+  "vocabulary_file": "vocabulary.json",
+  "checkpoint_source": "experiments/run1_vit/checkpoints/last.pt (epoch 7, step 85517), exported to ONNX"
+}

V1.1_onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8567852deb135eccfe4b8445d48e4476ee8846436486679adc0642cfeda07d13
+size 993246982

V1.1_onnx/pr_thresholds.json ADDED Viewed

	@@ -0,0 +1,102 @@

+{
+  "checkpoint": "L:\\Dab\\OppaiOracle\\experiments\\run1_vit\\checkpoints\\last.pt",
+  "checkpoint_epoch": 7,
+  "checkpoint_step": 85517,
+  "val_samples": 30000,
+  "num_tags_evaluated": 19292,
+  "skip_indices": [
+    0,
+    1
+  ],
+  "sweep": {
+    "min": 0.001,
+    "max": 0.999,
+    "step": 0.001,
+    "count": 999
+  },
+  "micro": {
+    "pr_breakeven": {
+      "threshold": 0.7926946841478348,
+      "precision": 0.6989787817001343,
+      "recall": 0.6989787817001343,
+      "f1": 0.6989787816996342,
+      "interpolated": true
+    },
+    "f1_optimal": {
+      "threshold": 0.805,
+      "precision": 0.7304752469062805,
+      "recall": 0.6735844016075134,
+      "f1": 0.7008772492408752
+    }
+  },
+  "macro_single_threshold": {
+    "support_ge_0": {
+      "tags_evaluated": 19292,
+      "tags_total": 19292,
+      "pr_breakeven": {
+        "threshold": 0.7596096462607383,
+        "precision": 0.5969547033309937,
+        "recall": 0.5969547033309937,
+        "f1": 0.5969547033304936,
+        "interpolated": true
+      },
+      "f1_optimal": {
+        "threshold": 0.757,
+        "precision": 0.5935518145561218,
+        "recall": 0.6006690859794617,
+        "f1": 0.5715920925140381
+      }
+    },
+    "support_ge_1": {
+      "tags_evaluated": 18334,
+      "tags_total": 19292,
+      "pr_breakeven": {
+        "threshold": 0.7596096495985984,
+        "precision": 0.6281471848487854,
+        "recall": 0.6281471848487854,
+        "f1": 0.6281471848482854,
+        "interpolated": true
+      },
+      "f1_optimal": {
+        "threshold": 0.757,
+        "precision": 0.6245664358139038,
+        "recall": 0.6320556402206421,
+        "f1": 0.601459264755249
+      }
+    },
+    "support_ge_5": {
+      "tags_evaluated": 10406,
+      "tags_total": 19292,
+      "pr_breakeven": {
+        "threshold": 0.7532626820802688,
+        "precision": 0.5863859057426453,
+        "recall": 0.5863859057426453,
+        "f1": 0.5863859057421452,
+        "interpolated": true
+      },
+      "f1_optimal": {
+        "threshold": 0.754,
+        "precision": 0.5886526703834534,
+        "recall": 0.5845012068748474,
+        "f1": 0.5600905418395996
+      }
+    }
+  },
+  "per_tag": {
+    "min_support": 5,
+    "pr_summary": {
+      "tags_with_support": 10406,
+      "mean_threshold": 0.7503859471750784,
+      "median_threshold": 0.751,
+      "p25_threshold": 0.714,
+      "p75_threshold": 0.7883596608266235,
+      "mean_precision": 0.5908225646441903,
+      "mean_recall": 0.5908225646241425
+    },
+    "f1_summary": {
+      "mean_threshold": 0.7558264462809917,
+      "median_threshold": 0.759,
+      "mean_f1": 0.6410075075736091
+    }
+  }
+}

V1.1_onnx/preprocessing.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "image_size": 448,
+  "patch_size": 16,
+  "num_channels": 3,
+  "color_order": "RGB",
+  "resize_mode": "letterbox",
+  "pad_color_rgb": [
+    114,
+    114,
+    114
+  ],
+  "normalize_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "normalize_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "input_dtype": "float32",
+  "input_layout": "BCHW",
+  "onnx_inputs": {
+    "pixel_values": {
+      "shape": "(batch_size, 3, 448, 448)",
+      "dtype": "float32",
+      "description": "Letterboxed and normalized image tensor. Preprocessing is NOT in the graph; do it externally."
+    },
+    "padding_mask": {
+      "shape": "(batch_size, 448, 448)",
+      "dtype": "bool",
+      "description": "True = padded pixel, False = valid pixel. Pass an all-False mask if your image fills the frame."
+    }
+  },
+  "onnx_outputs": {
+    "probabilities": {
+      "shape": "(batch_size, 19294)",
+      "dtype": "float32",
+      "activation": "sigmoid (already applied inside the graph)"
+    }
+  },
+  "opset_version": 20,
+  "dynamic_batch": true,
+  "embedded_metadata": {
+    "vocabulary": "Embedded as gzip+base64 in the ONNX metadata_props (key: vocab_b64_gzip).",
+    "tags_csv": "selected_tags.csv mirrors index_to_tag for SmilingWolf-style tagger UIs."
+  },
+  "notes": [
+    "Letterbox resize keeps aspect ratio; pad with the RGB color above to reach 448x448.",
+    "Normalize per-channel: (x/255 - mean) / std after letterboxing.",
+    "Recommended thresholds are in pr_thresholds.json (per-tag and global)."
+  ]
+}

V1.1_onnx/selected_tags.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

V1.1_onnx/vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

V1.1_safetensors/README.md ADDED Viewed

	@@ -0,0 +1,259 @@

+---
+license: apache-2.0
+pipeline_tag: image-classification
+language:
+- en
+tags:
+- anime
+- anime-tagger
+- tagger
+- image-tagging
+- multi-label
+- multi-label-classification
+- vision-transformer
+- vit
+- illustration
+- danbooru
+- safetensors
+- onnx
+---
+## TL;DR
+A multi-label anime tagger trained from scratch on a \~5.9M image dataset that received a targeted cleaning and vocabulary-expansion pass before training. The corrections touched roughly **1.3M tags** — large in absolute terms, but only on the order of **\~3% of all tags** in the corpus, so this is best described as a *targeted* cleaning rather than a heavy one. The pass was deliberately weighted toward **low-frequency tags**, which is where mislabels and missing labels hurt a tagger the most. On my evaluation set the model achieves the best precision-equals-recall point and a good mAP relative to comparable open tagger checkpoints, but the underlying training data still contains category-level noise that no amount of training would have erased. **All predictions should be human-reviewed before they are trusted.**
+Two checkpoints are released here. **V1** is the from-scratch 320×320 model. **V1.1** is a 448×448 fine-tune of V1 and on this evaluation set posts a modest mAP gain over V1 (overall val/mAP 0.674 vs. 0.614, ~+6 points absolute, ~+10% relative). The fine-tune helps across every frequency bucket but does not transform results — both checkpoints inherit the same source-data label noise. Pick the checkpoint whose native resolution matches the resolution you intend to feed it (see *Variants* below).
+A live demo is available on the companion Space: [Grio43/OppaiOracle](https://huggingface.co/spaces/Grio43/OppaiOracle).
+---
+## Variants — which checkpoint should I use?
+| Checkpoint | Native resolution | How it was produced | When to use |
+|---|---|---|---|
+| **V1** | 320×320 | Trained **from scratch** at 320×320. This is the model's native resolution. | The right pick if you are running inference at 320×320 or if throughput matters. |
+| **V1.1** | 448×448 | A **fine-tune of V1** at 448×448. Position embeddings were interpolated from the 20×20 grid to 28×28, optimizer state was reset, and training continued at the new resolution following the FixRes / DeiT III progressive-resolution recipe. Trained for 6 of 15 planned epochs and stopped early — see *Performance notes / V1.1 headline numbers* below for the rationale. | Use when you specifically want 448×448 inference for finer spatial detail (small accessories, eye details). V1.1 outperforms V1 on every frequency bucket of this eval set (numbers in the *Performance notes* section), but the gain is modest — V1 remains a fully reasonable choice if you are running at 320×320. |
+Two practical notes:
+- **Match input resolution to the checkpoint.** Feeding 448×448 images to V1, or 320×320 images to V1.1, will give worse results than matching them. The position-embedding grid is fixed at load time.
+- **V1 is not deprecated by V1.1.** They are siblings with different operating points, not generations of the same model. The V1.1 mAP gain over V1 is real but small (~+6 points overall) — pick on resolution, not on the assumption that V1.1 is strictly better.
+### Files in this repo
+- `V1_safetensors/` — V1 in `safetensors` format with `config.json` and `preprocessing.json`. Use this for PyTorch / custom inference.
+- `V1_onnx/` — V1 exported to ONNX. Use this for ONNX Runtime inference (CPU, DirectML, CUDA EP).
+- `V1.1_safetensors/` — V1.1 in `safetensors` format with `config.json` and `preprocessing.json`.
+- `V1.1_onnx/` — V1.1 exported to ONNX.
+- Each variant directory ships `vocabulary.json`, `selected_tags.csv`, `pr_thresholds.json`, and a copy of this README.
+---
+## How this model came to be
+I started with a corpus of roughly **5.9 million images** with publicly-sourced tags. Before training anything of my own, I used **SmilingWolf's ViT v3 tagger** to help clean the dataset. With that pipeline I:
+- **Removed \~300k incorrect tags** from images where the public labels disagreed with the AI tagger and a human spot-check confirmed the public labels were wrong.
+- **Added \~1,000,000 missing tags** in the same fashion — places where the AI tagger surfaced a label the public tag set had simply omitted, and human review agreed.
+That is \~1.3M corrections in total, which is only on the order of **\~3% of the tags in the corpus**. This was a *targeted* pass, not a top-to-bottom relabel. Effort was deliberately concentrated on **low-frequency tags**, on the assumption that mislabels and missing labels do disproportionate damage in the long tail — a missing label on a tag with 800 positives in the entire dataset matters far more than a missing label on a tag with 800k positives.
+I then trained a small "light" model on this cleaned dataset, primarily as a vehicle to **expand the tag vocabulary by \~20,000 additional low-frequency tags** that the original tag set under-represented. That expanded vocabulary is what the released model was trained against.
+The released checkpoint is the main training run on the cleaned dataset with the expanded vocabulary.
+---
+## What "cleaned" actually means (and what it does not)
+This is the most important section of this release. The cleaning was real work, but it was not omniscient, and the dataset still has structured, category-level label noise that you will see in the model's outputs. Most of these issues are inherited directly from the **publicly-sourced source datasets** — they are not new noise introduced during cleaning; they are pre-existing patterns that the cleaning pass touched but did not resolve at the category level.
+The categories below are **illustrative, not exhaustive.** Many other tag families show similarly deep-rooted issues. Two failure modes show up across most of them, but they are not equal in size:
+- **Missing tags (by far the dominant problem)** — concepts that are clearly present in an image but were never tagged at the source. This is the single biggest source of noise in the entire dataset. See the dedicated subsection below for the empirical scale.
+- **Wrong tags (not uncommon, but secondary)** — visually similar concepts confused with each other in the source data (the bow / bowtie / ribbon / ascot / necktie cluster, color buckets, length and size buckets). These are real and plentiful, just not the dominant failure mode.
+### Missing tags (the dominant noise mode)
+If you only remember one thing from this section, remember this: **the biggest single problem in the source data is not wrong tags, it is missing tags.** Wrong tags are not uncommon either, but they are dwarfed in volume by labels that should be present and simply aren't.
+A rough empirical sense of the gap, from manual review:
+- A typical image in this dataset arrives with roughly **\~28 tags** from the source.
+- A reasonably-tagged image — judged by what is actually visible, sticking to common in-vocabulary concepts and not reaching for rare tags — should have **50+ tags**, often more.
+- During spot-checks I have routinely taken images that arrived with **\~40 tags up past 60 tags** just by adding common, obviously-present concepts. That is without making any effort to surface rare tags; including those would push the number higher still.
+So the source tag count is on the order of **half** of what a careful tagger would emit on the same image, and the gap is concentrated in concepts that are not subjective — they are simply omissions. The cleaning pass added \~1M missing tags back, but with the gap this large there are many millions still missing across the corpus.
+The training-time consequence is that for every missing-but-present tag, the model receives **no positive gradient at all** for that concept on that image — only an implicit negative through the loss. This systematically biases the model toward under-predicting any tag with a high source-data omission rate, and the effect is uneven across tags: some tag families are well-tagged at the source and some are very sparsely tagged. Practically, this means **low predicted scores are less informative than they look** — a tag scoring below threshold may be genuinely absent, or it may be a concept the model has learned is "usually unlabeled even when present."
+### Color tags
+Color-named tags (eye color, hair color, general color tags) are **poorly tagged at the source**, and the noise that survived cleaning is dataset-wide. Every color tag in the vocabulary has some version of this problem; some are worse than others.
+- **Obvious failures were cleanable.** A bright, unambiguous yellow mislabeled as `blue_eyes` is exactly the kind of disagreement the AI-assisted pass catches, and those got fixed. The residual noise is not the obvious-failure kind.
+- **The deep-rooted issue is perceptual, not technical.** The category boundaries between color tags are drawn by *human viewers*, not by RGB codes. Different taggers carve up the spectrum differently, and any single color tag in this dataset covers a fairly wide perceptual band of that color. There is no clean RGB threshold I could have used to mechanically separate the categories, which is exactly why manual cleaning at the category level is intractable.
+- **Adjacent / overlapping colors leak into each other in predictable patterns.** Some examples I have observed:
+  - `aqua_*` tags heavily pollute both **blue** and **green** based tags — aqua sits perceptually between them and gets sorted into all three buckets across the corpus.
+  - `yellow_*` tags overlap meaningfully with **red** and **orange** tags — warm-spectrum boundaries are inconsistent in the source data.
+  - Similar patterns exist for purple/blue/pink, brown/orange/red, and black/very-dark-anything.
+- Color tags are also **high frequency**, so the noise is spread across millions of images rather than concentrated where it could be hand-fixed.
+- When I sampled live in-the-wild images and compared the model's predictions to a careful human reading, the same source-data confusion patterns were still present in the predictions. The model is faithfully reproducing the source-data label distribution, which is itself noisy along the color axis.
+### Hair length
+The hair length tags — `very_short_hair`, `short_hair`, `medium_hair`, `long_hair`, `very_long_hair` — all have major boundary issues. `long_hair` and `very_long_hair` are the worst offenders; the source labels routinely disagree with each other across visually similar images. The model inherits this confusion.
+### Other "objective size" body-part tags
+The same problem applies to tags that sound objective but are really continuous and judgement-dependent: `flat_chest`, `small_breasts`, `medium_breasts`, `large_breasts`, `huge_breasts`. These are inherently noisy supervision targets for a classifier — adjacent buckets are not crisply separable in the source data, and the model cannot do better than the labels it was given.
+### Neckwear and small accessories (bows, bowties, ribbons, ascots, neckties)
+This cluster of tags has systemic issues at the source. `bow`, `bowtie`, `ribbon`, `ascot`, and `necktie` are visually similar but distinct accessories, and the public source data routinely confuses them — the same physical object will be tagged differently across images, and adjacent categories leak into each other in both directions. The cleaning pass touched obvious mistakes here but did not normalize the category boundaries; the model learns the same fuzzy boundaries the source data has.
+These five are the cluster I happened to look at closely. Many other small-accessory and clothing-detail tags show the same pattern — visually similar items, fuzzy source-data boundaries, residual confusion in the model. Treat any prediction in this category as a *suggestion* to inspect, not a final answer.
+### Character-vs-concept leakage
+For some tags, the data is dominated by a small number of characters. When that happens, the model tends to learn **the character** rather than **the concept** the tag was meant to represent. Without a curated golden-standard set that deliberately decouples the concept from those characters, this is very hard to fix at training time.
+### My estimate of cleaning quality
+The 300k removals and \~1M additions were **AI-assisted and then human-reviewed by me**. My honest estimate is that the corrections themselves are **<5% error**. That is a statement about the *changes I made*, not about the *underlying dataset* — the underlying dataset still contains the structured noise described above, because cleaning was driven by AI-flagged disagreements and the AI shares the same color/length/size confusion as the source data does.
+---
+## How to use this model responsibly
+- **Human review every output.** This applies most strongly to color, hair length, and size-bucket tags. The model is a fast first pass, not an authoritative labeler.
+- **Treat sibling tags as a group, not a hard pick.** If the model emits `blue_eyes` with high confidence, also check the `purple_eyes` / `aqua_eyes` / `black_eyes` scores before you commit.
+- **Do not use the raw output as ground-truth for downstream training** without manual review. The very confusion patterns that this model can't resolve will get baked into your downstream model.
+- **For thresholding, prefer per-tag thresholds over a single global threshold.** Different tag families have very different precision/recall behavior on this dataset.
+---
+## Performance notes
+On my evaluation set this model achieves:
+- The best **precision-equals-recall** point I have measured among comparable open anime taggers.
+- A solid **mAP** relative to the same comparison set.
+### V1 headline numbers (e27/40, Phase 1, 320×320, 19,292 tags)
+| Metric | Value |
+|---|---|
+| Macro F1 | 0.588 |
+| Micro F1 | 0.659 |
+| P=R threshold (macro / micro) | 0.614 / 0.670 |
+| Overall val/mAP | 0.614 |
+**mAP broken out by tag frequency bucket:**
+| Frequency bucket | mAP |
+|---|---|
+| 500–999 (rare) | 0.589 |
+| 1K–5K (mid) | 0.598 |
+| 5K–10K (head) | 0.535 |
+| 10K+ (very common) | 0.542 |
+Note the inversion: rare/mid tags out-score head/very-common tags on mAP. This is consistent with the missing-tag bias described above — high-frequency concepts are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
+### V1.1 headline numbers (e6/15, Phase 2, 448×448, 19,292 tags)
+| Metric | Value |
+|---|---|
+| Overall val/mAP | 0.674 |
+F1 / P=R numbers are intentionally not reported alongside this row — see the *Why F1 numbers are not reported for V1.1* paragraph below for the calibration reason.
+**mAP broken out by tag frequency bucket — V1 vs. V1.1 on the same eval set:**
+| Frequency bucket | V1 mAP | V1.1 mAP | Δ |
+|---|---|---|---|
+| 500–999 (rare) | 0.589 | 0.645 | +0.056 |
+| 1K–5K (mid) | 0.598 | 0.656 | +0.058 |
+| 5K–10K (head) | 0.535 | 0.595 | +0.060 |
+| 10K+ (very common) | 0.542 | 0.606 | +0.064 |
+| **Overall** | **0.614** | **0.674** | **+0.060** |
+The same rare-vs-head inversion noted for V1 (rare/mid > head/very-common on mAP) is still present in V1.1, and for the same reason — high-frequency tags are the ones most often present-but-unlabeled in the source data, which depresses their measured precision against a noisy reference.
+**Why V1.1 stopped at 6 of 15 planned epochs.** Per-epoch mAP growth decelerated from ~+0.7%/epoch in early Phase 2 to ~+0.3%/epoch by epoch 5, while validation loss continued to fall and per-tag calibration shifted (mean activations per image dropped from ~4500 at epoch 0 to ~4200 at epoch 5, but the auto-stop F1 metric is calibration-floored at a fixed threshold of 0.2653 and therefore unreliable as a stop signal — see [TRAINING_HEALTH_TRACKER.md](../TRAINING_HEALTH_TRACKER.md)). At that growth rate, the remaining 9 epochs would have been operating in the regime where it is no longer cleanly distinguishable whether mAP gains are *real ranking improvement* or *memorization of the labeled subset of a noisy multi-label corpus* (the missing-positive bias documented earlier in this card sets a soft ceiling somewhere in this neighbourhood). Continuing was unlikely to buy enough real gain to justify the extra training time, so V1.1 ships at the epoch-5 / step-81822 checkpoint.
+**Why F1 numbers are not reported for V1.1.** V1.1's loss configuration (`gamma_neg=7.0`, `clip=0.2`) shifts the logit distribution relative to V1's (`gamma_neg=4.0`, `clip=0.05`). The in-training F1 metric uses a fixed threshold (0.2653) calibrated against V1's distribution, so V1.1's in-training F1 values are calibration-floored and not comparable to V1's reported F1. Reporting them alongside V1's would invite the wrong comparison. mAP, on the other hand, is threshold-independent and the V1 vs. V1.1 mAP comparison above is apples-to-apples — that is the comparison this card stands behind.
+I want to be honest about *why* I think it performs well: **it is almost certainly not because of a special training regimen.** The training recipe is grounded in standard ViT-from-scratch literature (DeiT / DeiT III / FixRes / ASL / AugReg) without exotic tricks. The most likely explanation is simply that the **input dataset is cleaner** than what most comparable taggers were trained on. If you are trying to reproduce or beat this result, I would put your effort into data curation before you put it into training-recipe tuning.
+---
+## Image augmentation settings (V1 and V1.1)
+For reproducibility, here are the exact augmentation pipelines used for each checkpoint. V1.1 is a fine-tune of V1, so its augmentation is a *reduced* version of V1's — narrower ranges and lower probabilities at the higher 448×448 resolution. The reductions follow EfficientNetV2 / FixRes guidance for progressive-resolution training, but only partially (\~¼ reduction rather than ½), because Phase 1 stopped at 33/40 epochs and the V1 base was under-converged when V1.1 began.
+| Augmentation | V1 (320×320, from scratch, 40 epochs planned, 33 trained) | V1.1 (448×448, fine-tune of V1, 15 epochs planned, 6 trained) |
+|---|---|---|
+| Horizontal flip | p = 0.5 | p = 0.5 |
+| Color jitter — brightness | 0.30 (p = 0.5) | 0.22 (p = 0.5) |
+| Color jitter — contrast | 0.20 (p = 0.5) | 0.15 (p = 0.5) |
+| Color jitter — saturation | 0.08 (p = 0.5) | 0.06 (p = 0.5) |
+| Random rotation | p = 0.50, ±[2°, 8°], bicubic | p = 0.30, ±[2°, 5°], bicubic |
+| Gaussian blur | p = 0.30, kernel = 3, σ ∈ [0.1, 1.5] | p = 0.15, kernel = 3, σ ∈ [0.1, 1.0] |
+| Random erasing | disabled | disabled |
+| Normalization (mean / std) | [0.5, 0.5, 0.5] / [0.5, 0.5, 0.5] | [0.5, 0.5, 0.5] / [0.5, 0.5, 0.5] |
+| Letterbox pad color | [114, 114, 114] | [114, 114, 114] |
+Notes on a few of these choices:
+- **Saturation is held well below brightness/contrast** in both phases. Saturation is the only color-jitter axis that directly attacks color-named tag identity (`blue_eyes`, `pink_skin`, etc.); brightness and contrast are luminance-driven and largely chroma-safe. The ratio (\~¼ of brightness) is taken from BYOL's asymmetric augmentation.
+- **Rotation is kept on at V1.1**, against the plain FixRes recommendation. The original plan was to disable it at 448 for spatial precision, but with V1 under-converged it was safer to keep a residual rotational-invariance signal. The compromise was a tighter angle band (±5° vs. ±8°) and a lower fire rate (0.30 vs. 0.50).
+- **Gaussian blur is also kept on at V1.1** for the same reason (under-converged base + reduced color/rotation aug → strips too much input variability if blur is dropped entirely). Frequency was halved and the σ ceiling pulled in from 1.5 to 1.0.
+- **No mixup, no cutmix, no RandAugment, no random erasing** in either phase. The recipe is intentionally close to DeiT III's "3-Augment" regime (flip + color jitter + blur) plus a small rotation, not a heavy AugReg/RandAugment stack.
+---
+## Limitations summary
+| Area | Severity | Notes |
+|---|---|---|
+| Color tags (eye/hair/general) | **High** | Source-data noise survives; sibling colors leak into each other |
+| Hair length (especially `long_hair`, `very_long_hair`) | **High** | Boundary tags inherently noisy in source |
+| Size-bucket body-part tags | **High** | Continuous quantity discretized into noisy buckets |
+| Neckwear (`bow`, `bowtie`, `ribbon`, `ascot`, `necktie`) | **High** | Visually similar accessories routinely confused at source; representative of a broader small-accessory pattern |
+| Missing tags (concept present, no label) | **Dominant** | The single biggest source of noise in the corpus. Typical \~28 tags/image vs. 50+ that should be present. \~1M added back during cleaning; many millions remain. Hurts performance broadly and biases the model toward under-prediction. |
+| Character-overwhelmed tags | **Medium** | Some tags are learned as proxies for specific characters |
+| Rare / low-frequency tags | **Medium** | The +20k vocabulary expansion helps, but tail tags still see fewer examples |
+| Anything not on the above list | Use with normal caution | The above are illustrative, not exhaustive — many tag families show similar source-data issues |
+---
+## What's next (V2)
+Once a refreshed 2026-vintage source dataset becomes available, I plan to start work on V2. The biggest single change between V1 and V2 will not be the model — it will be **substantially more time spent on data cleaning before training begins**, with a particular focus on:
+- Building a curated **golden-standard slice** for color tags, hair-length tags, and size-bucket tags so those categories can be supervised against deliberately disambiguated examples.
+- Deeper character/concept decoupling so character-overwhelmed tags learn the actual concept.
+- Better measurement of "true" performance on a hand-relabeled validation slice, so the headline metrics are not silently inflated by the same missing-positive bias that affects the training data.
+V1 ships with the noise it ships with. V2 is where I plan to do something about it.
+---
+## Acknowledgments
+- **SmilingWolf** for the ViT v3 tagger, which made the initial cleaning pass tractable. None of this would have been feasible without an existing strong tagger to use as a second opinion.
+- The broader anime-tagger open-source community for the public tag corpora and prior model checkpoints I compared against.
+---
+## License / usage
+Released under the **Apache License 2.0**. You may use, modify, and redistribute the model and accompanying files for personal, research, or commercial purposes, provided you retain the license notice and attribution.
+**Intended use.** Research and downstream tooling for multi-label tagging of anime / illustration imagery.
+**Out-of-scope use.** Decisions about real people; safety-critical pipelines that depend on label correctness without human review; training a downstream model on raw outputs without manual review (the missing-tag bias described above will propagate).

V1.1_safetensors/config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "architecture_type": "vit",
+  "num_labels": 19294,
+  "num_channels": 3,
+  "image_size": 448,
+  "patch_size": 16,
+  "hidden_size": 1024,
+  "num_hidden_layers": 18,
+  "num_attention_heads": 16,
+  "intermediate_size": 4096,
+  "hidden_dropout_prob": 0.1,
+  "pos_dropout": 0.0,
+  "attention_dropout": 0.05,
+  "drop_path_rate": 0.2,
+  "initializer_range": 0.02,
+  "layer_norm_eps": 1e-06,
+  "use_fp32_layernorm": false,
+  "attention_bias": true,
+  "num_groups": 20,
+  "tags_per_group": 10000,
+  "training_epoch": 7,
+  "training_step": 85517,
+  "vocab_format_version": 1,
+  "vocab_sha256": "ad3c33d3b760bd0d15bd4631f441d47fcb136c7a6e53473b5588d760907b0316",
+  "state_dict_keys_format": "plain (no _orig_mod./module. prefixes)",
+  "state_dict_dtype": "bfloat16",
+  "checkpoint_source": "experiments/run1_vit/checkpoints/last.pt (epoch 7, step 85517)"
+}

V1.1_safetensors/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ced687b2e866997c572987371d9733d9f0d7b7e1e5771d23e692e74788bf4226
+size 496226548

V1.1_safetensors/pr_thresholds.json ADDED Viewed

	@@ -0,0 +1,102 @@

+{
+  "checkpoint": "L:\\Dab\\OppaiOracle\\experiments\\run1_vit\\checkpoints\\last.pt",
+  "checkpoint_epoch": 7,
+  "checkpoint_step": 85517,
+  "val_samples": 30000,
+  "num_tags_evaluated": 19292,
+  "skip_indices": [
+    0,
+    1
+  ],
+  "sweep": {
+    "min": 0.001,
+    "max": 0.999,
+    "step": 0.001,
+    "count": 999
+  },
+  "micro": {
+    "pr_breakeven": {
+      "threshold": 0.7926946841478348,
+      "precision": 0.6989787817001343,
+      "recall": 0.6989787817001343,
+      "f1": 0.6989787816996342,
+      "interpolated": true
+    },
+    "f1_optimal": {
+      "threshold": 0.805,
+      "precision": 0.7304752469062805,
+      "recall": 0.6735844016075134,
+      "f1": 0.7008772492408752
+    }
+  },
+  "macro_single_threshold": {
+    "support_ge_0": {
+      "tags_evaluated": 19292,
+      "tags_total": 19292,
+      "pr_breakeven": {
+        "threshold": 0.7596096462607383,
+        "precision": 0.5969547033309937,
+        "recall": 0.5969547033309937,
+        "f1": 0.5969547033304936,
+        "interpolated": true
+      },
+      "f1_optimal": {
+        "threshold": 0.757,
+        "precision": 0.5935518145561218,
+        "recall": 0.6006690859794617,
+        "f1": 0.5715920925140381
+      }
+    },
+    "support_ge_1": {
+      "tags_evaluated": 18334,
+      "tags_total": 19292,
+      "pr_breakeven": {
+        "threshold": 0.7596096495985984,
+        "precision": 0.6281471848487854,
+        "recall": 0.6281471848487854,
+        "f1": 0.6281471848482854,
+        "interpolated": true
+      },
+      "f1_optimal": {
+        "threshold": 0.757,
+        "precision": 0.6245664358139038,
+        "recall": 0.6320556402206421,
+        "f1": 0.601459264755249
+      }
+    },
+    "support_ge_5": {
+      "tags_evaluated": 10406,
+      "tags_total": 19292,
+      "pr_breakeven": {
+        "threshold": 0.7532626820802688,
+        "precision": 0.5863859057426453,
+        "recall": 0.5863859057426453,
+        "f1": 0.5863859057421452,
+        "interpolated": true
+      },
+      "f1_optimal": {
+        "threshold": 0.754,
+        "precision": 0.5886526703834534,
+        "recall": 0.5845012068748474,
+        "f1": 0.5600905418395996
+      }
+    }
+  },
+  "per_tag": {
+    "min_support": 5,
+    "pr_summary": {
+      "tags_with_support": 10406,
+      "mean_threshold": 0.7503859471750784,
+      "median_threshold": 0.751,
+      "p25_threshold": 0.714,
+      "p75_threshold": 0.7883596608266235,
+      "mean_precision": 0.5908225646441903,
+      "mean_recall": 0.5908225646241425
+    },
+    "f1_summary": {
+      "mean_threshold": 0.7558264462809917,
+      "median_threshold": 0.759,
+      "mean_f1": 0.6410075075736091
+    }
+  }
+}

V1.1_safetensors/preprocessing.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "image_size": 448,
+  "patch_size": 16,
+  "num_channels": 3,
+  "color_order": "RGB",
+  "resize_mode": "letterbox",
+  "pad_color_rgb": [
+    114,
+    114,
+    114
+  ],
+  "normalize_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "normalize_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "input_dtype": "float32",
+  "input_layout": "BCHW",
+  "padding_mask": {
+    "required_for_pytorch_forward": true,
+    "shape": "(B, H, W)",
+    "dtype": "bool",
+    "convention": "True = padded pixel, False = valid pixel",
+    "all_false_equivalent_to": "no masking"
+  },
+  "output": {
+    "name": "tag_logits",
+    "shape": "(B, 19294)",
+    "activation": "apply sigmoid for probabilities"
+  },
+  "notes": [
+    "Letterbox resize keeps aspect ratio; pad with the RGB color above to reach 448x448.",
+    "Normalize per-channel: (x/255 - mean) / std after letterboxing.",
+    "Built-in recommended thresholds are in pr_thresholds.json (per-tag and global)."
+  ]
+}

V1.1_safetensors/selected_tags.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

V1.1_safetensors/vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

config.json CHANGED Viewed

@@ -1,9 +1,9 @@
 {
   "model_name": "OppaiOracle",
-  "version": "V1",
   "architecture_type": "vit",
   "num_labels": 19294,
-  "image_size": 320,
   "patch_size": 16,
   "hidden_size": 1024,
   "num_hidden_layers": 18,
@@ -12,16 +12,34 @@
   "variants": {
     "V1_safetensors": {
       "format": "safetensors",
       "config": "V1_safetensors/config.json",
       "weights": "V1_safetensors/model.safetensors"
     },
     "V1_onnx": {
       "format": "onnx",
       "config": "V1_onnx/config.json",
       "weights": "V1_onnx/model.onnx"
     }
   },
   "vocab_format_version": 1,
-  "vocab_sha256": "b9f95e88fb7e30669077bb761e9a66642ec526c1e10d65336a2a2b628141199d",
-  "checkpoint_source": "experiments/run1_vit/checkpoints/last.pt (epoch 33)"
 }

 {
   "model_name": "OppaiOracle",
+  "version": "V1.1",
+  "released_versions": ["V1", "V1.1"],
   "architecture_type": "vit",
   "num_labels": 19294,
   "patch_size": 16,
   "hidden_size": 1024,
   "num_hidden_layers": 18,
   "variants": {
     "V1_safetensors": {
       "format": "safetensors",
+      "image_size": 320,
       "config": "V1_safetensors/config.json",
       "weights": "V1_safetensors/model.safetensors"
     },
     "V1_onnx": {
       "format": "onnx",
+      "image_size": 320,
       "config": "V1_onnx/config.json",
       "weights": "V1_onnx/model.onnx"
+    },
+    "V1.1_safetensors": {
+      "format": "safetensors",
+      "image_size": 448,
+      "dtype": "bfloat16",
+      "config": "V1.1_safetensors/config.json",
+      "weights": "V1.1_safetensors/model.safetensors"
+    },
+    "V1.1_onnx": {
+      "format": "onnx",
+      "image_size": 448,
+      "config": "V1.1_onnx/config.json",
+      "weights": "V1.1_onnx/model.onnx"
     }
   },
   "vocab_format_version": 1,
+  "vocab_sha256": "ad3c33d3b760bd0d15bd4631f441d47fcb136c7a6e53473b5588d760907b0316",
+  "checkpoint_sources": {
+    "V1": "experiments/run1_vit/checkpoints/last.pt (epoch 33)",
+    "V1.1": "experiments/run1_vit/checkpoints/last.pt (epoch 7, step 85517)"
+  }
 }