File size: 6,163 Bytes
d296b97
 
 
f7aa858
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3861fd1
f7aa858
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3861fd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: apache-2.0
---

# Anime Classifiers

[Training/inference code](https://github.com/city96/CityClassifiers) | [Live Demo](https://huggingface.co/spaces/city96/AnimeClassifiers-demo)


These are models that predict whether a concept is present in an image. The performance on high resolution images isn't very good, especially when detecting subtle image effects such as noise. This is due to CLIP using a fairly low resolution (336x336/224x224).

To combat this, tiling is used at inference time. The input image is first downscaled to 1536 (shortest edge - See `TF.functional.resize`), then 5 separate 512x512 areas are selected (4 corners + center - See `TF.functional.five_crop`). This helps as the downscale factor isn't nearly as drastic as passing the entire image to CLIP. As a bonus, it also avoids the issues with odd aspect ratios requiring cropping or letterboxing to work.

![Tiling](https://github.com/city96/CityClassifiers/assets/125218114/66a30048-93ce-4c00-befc-0d986c84ec9f)

As for the training, it will be detailed in the sections below for the individual classifiers. At first, specialized models will be trained to a relatively high accuracy, building up a high quality but specific dataset in the process.

Then, these models will be used to split/sort each other's the datasets. The code will need to be updated to support one image being part of more than one class, but the final result should be a clean dataset where each target aspect acts as a "tag" rather than a class.

## Architecture

The base model itself is fairly simple. It takes embeddings from a CLIP model (in this case, `openai/clip-vit-large-patch14`) and expands them to 1024 dimensions. From there, a single block with residuals is followed by a few linear layers which converge down to the final output.

For the classifier models, the final output goes through `nn.Softmax`.

# Models

## Chromatic Aberration - Anime

### Design goals

The goal was to detect [chromatic aberration](https://en.wikipedia.org/wiki/Chromatic_aberration?useskin=vector) in images.

For some odd reason, this effect has become a popular post processing effect to apply to images and drawings. While attempting to train an ESRGAN model, I noticed an odd halo around images and quickly figured out that this effect was the cause. This classifier aims to work as a base filter to remove such images from the dataset.

### Issues

- Seems to get confused by excessive HSV noise
- Triggers even if the effect is only applied to the background
- Sometimes triggers on rough linework/sketches (i.e. multiple semi-transparent lines overlapping)
- Low accuracy on 3D/2.5D with possible false positives.

### Training

The training settings can be found in the `config/CCAnime-ChromaticAberration-v1.yaml` file (7e-6 LR, cosine scheduler, 100K steps).

![loss](https://github.com/city96/CityClassifiers/assets/125218114/475f1241-2b4e-4fc9-bbcd-261b85b8b491)

![loss-eval](https://github.com/city96/CityClassifiers/assets/125218114/88d6f090-aa6f-42ad-9fd0-8c5d267fce5e)


Final dataset score distribution for v1.16:
```
3215 images in dataset.
0_reg       -  395 ||||
0_reg_booru - 1805 ||||||||||||||||||||||
1_chroma    -  515 ||||||
1_synthetic -  500 ||||||

Class ratios:
00 - 2200 |||||||||||||||||||||||||||
01 - 1015 ||||||||||||
```

Version history:

- v1.0 - Initial test model, dataset is fully synthetic (500 images). Effect added by shifting red/blue channel by a random amount using chaiNNer.
- v1.1 - Added 300 images tagged "chromatic_aberration" from gelbooru. Added first 1000 images from danbooru2021 as reg images
- v1.2 - Used the newly trained predictor to filter the existing datasets - found ~70 positives in the reg set and ~30 false positives in the target set.
- v1.3-v1.16 - Repeatedly ran predictor against various datasets, adding false positives/negatives back into the dataset, sometimes running against the training set to filter out misclassified images as the predictor got better. Added/removed images were manually checked (My eyes hurt).

## Image Compression - Anime

### Design goals

The goal was to detect [compression artifacts](https://en.wikipedia.org/wiki/Compression_artifact?useskin=vector) in images.

This seems like the next logical step in dataset filtering. The flagged images can either be cleaned up or tagged correctly so the resulting network won't inherit the image artifacts.

### Issues

- Low accuracy on 3D/2.5D with possible false positives.

### Training

The training settings can be found in the `config/CCAnime-Compression-v1.yaml` file (2.7e-6 LR, cosine scheduler, 40K steps).

![loss](https://github.com/city96/CityClassifiers/assets/125218114/9d0294bf-81ee-4b30-89ae-3b1aca27788e)

The eval loss only uses a single image for each target class, hence the questionable nature of the graph.

![loss-eval](https://github.com/city96/CityClassifiers/assets/125218114/77c9882f-6263-4926-b3ee-a032ef7784ea)


Final dataset score distribution for v1.5:
```
22736 images in dataset.
0_fpl      -  108
0_reg_aes  -  142
0_reg_gel  - 7445 |||||||||||||
1_aes_jpg  -  103
1_fpl      -    8
1_syn_gel  - 7445 |||||||||||||
1_syn_jpg  -   40
2_syn_gel  - 7445 |||||||||||||
2_syn_webp -    0

Class ratios:
00 - 7695 |||||||||||||
01 - 7596 |||||||||||||
02 - 7445 |||||||||||||
```

Version history:

- v1.0 - Initial test model, dataset consists of 40 hand picked images and their jpeg compressed counterpart. Compression is done with ChaiNNer, compression rate is randomized.
- v1.1 - Added more images by re-filtering the input dataset using the v1 model, keeping only the top/bottom 10%.
- v1.2 - Used the newly trained predictor to filter the existing datasets - found ~70 positives in the reg set and ~30 false positives in the target set.
- v1.3 - Scraped ~7500 images from gelbooru, filtering for min. image size of at least 3000 and a file size larger than 8MB. Compressed using ChaiNNer as before.
- v1.4 - Added webm compression to the list, decided against adding GIF/dithering since it's rarely used nowadays.
- v1.5 - Changed LR/step count to better match larger dataset. Added false positives/negatives from v1.4.