Anime Classifiers

Training/inference code | Live Demo

These are models that predict whether a concept is present in an image. The performance on high resolution images isn't very good, especially when detecting subtle image effects such as noise. This is due to CLIP using a fairly low resolution (336x336/224x224).

To combat this, tiling is used at inference time. The input image is first downscaled to 1536 (shortest edge - See TF.functional.resize), then 5 separate 512x512 areas are selected (4 corners + center - See TF.functional.five_crop). This helps as the downscale factor isn't nearly as drastic as passing the entire image to CLIP. As a bonus, it also avoids the issues with odd aspect ratios requiring cropping or letterboxing to work.

As for the training, it will be detailed in the sections below for the individual classifiers. At first, specialized models will be trained to a relatively high accuracy, building up a high quality but specific dataset in the process.

Then, these models will be used to split/sort each other's the datasets. The code will need to be updated to support one image being part of more than one class, but the final result should be a clean dataset where each target aspect acts as a "tag" rather than a class.

Architecture

The base model itself is fairly simple. It takes embeddings from a CLIP model (in this case, openai/clip-vit-large-patch14) and expands them to 1024 dimensions. From there, a single block with residuals is followed by a few linear layers which converge down to the final output.

For the classifier models, the final output goes through nn.Softmax.

Models

Chromatic Aberration - Anime

Design goals

The goal was to detect chromatic aberration in images.

For some odd reason, this effect has become a popular post processing effect to apply to images and drawings. While attempting to train an ESRGAN model, I noticed an odd halo around images and quickly figured out that this effect was the cause. This classifier aims to work as a base filter to remove such images from the dataset.

Issues

Seems to get confused by excessive HSV noise
Triggers even if the effect is only applied to the background
Sometimes triggers on rough linework/sketches (i.e. multiple semi-transparent lines overlapping)
Low accuracy on 3D/2.5D with possible false positives.

Training

The training settings can be found in the config/CCAnime-ChromaticAberration-v1.yaml file (7e-6 LR, cosine scheduler, 100K steps).

Final dataset score distribution for v1.16:

3215 images in dataset.
0_reg       -  395 ||||
0_reg_booru - 1805 ||||||||||||||||||||||
1_chroma    -  515 ||||||
1_synthetic -  500 ||||||

Class ratios:
00 - 2200 |||||||||||||||||||||||||||
01 - 1015 ||||||||||||

Version history:

v1.0 - Initial test model, dataset is fully synthetic (500 images). Effect added by shifting red/blue channel by a random amount using chaiNNer.
v1.1 - Added 300 images tagged "chromatic_aberration" from gelbooru. Added first 1000 images from danbooru2021 as reg images
v1.2 - Used the newly trained predictor to filter the existing datasets - found ~70 positives in the reg set and ~30 false positives in the target set.
v1.3-v1.16 - Repeatedly ran predictor against various datasets, adding false positives/negatives back into the dataset, sometimes running against the training set to filter out misclassified images as the predictor got better. Added/removed images were manually checked (My eyes hurt).

Image Compression - Anime

Design goals

The goal was to detect compression artifacts in images.

This seems like the next logical step in dataset filtering. The flagged images can either be cleaned up or tagged correctly so the resulting network won't inherit the image artifacts.

Issues

Low accuracy on 3D/2.5D with possible false positives.

Training

The training settings can be found in the config/CCAnime-Compression-v1.yaml file (2.7e-6 LR, cosine scheduler, 40K steps).

The eval loss only uses a single image for each target class, hence the questionable nature of the graph.

Final dataset score distribution for v1.5:

22736 images in dataset.
0_fpl      -  108
0_reg_aes  -  142
0_reg_gel  - 7445 |||||||||||||
1_aes_jpg  -  103
1_fpl      -    8
1_syn_gel  - 7445 |||||||||||||
1_syn_jpg  -   40
2_syn_gel  - 7445 |||||||||||||
2_syn_webp -    0

Class ratios:
00 - 7695 |||||||||||||
01 - 7596 |||||||||||||
02 - 7445 |||||||||||||

Version history:

v1.0 - Initial test model, dataset consists of 40 hand picked images and their jpeg compressed counterpart. Compression is done with ChaiNNer, compression rate is randomized.
v1.1 - Added more images by re-filtering the input dataset using the v1 model, keeping only the top/bottom 10%.
v1.2 - Used the newly trained predictor to filter the existing datasets - found ~70 positives in the reg set and ~30 false positives in the target set.
v1.3 - Scraped ~7500 images from gelbooru, filtering for min. image size of at least 3000 and a file size larger than 8MB. Compressed using ChaiNNer as before.
v1.4 - Added webm compression to the list, decided against adding GIF/dithering since it's rarely used nowadays.
v1.5 - Changed LR/step count to better match larger dataset. Added false positives/negatives from v1.4.

city96
/

AnimeClassifiers

Anime Classifiers

Architecture

Models

Chromatic Aberration - Anime

Design goals

Issues

Training

Image Compression - Anime

Design goals

Issues

Training

Space using city96/AnimeClassifiers 1