city96 commited on
Commit
2a67449
1 Parent(s): e86c4fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md CHANGED
@@ -1,3 +1,86 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ ![Logo](https://github.com/city96/CityClassifiers/assets/125218114/0413003a-851d-42fc-b795-eae525b7b2e5)
5
+
6
+ Experimental opinionated aesthetic score models.
7
+
8
+ ## CityAesthetics - Anime
9
+
10
+ [Training/inference code](https://github.com/city96/CityClassifiers) | [Live Demo](https://huggingface.co/spaces/city96/CityAesthetics-demo)
11
+
12
+ ### Design goals
13
+
14
+ The goal was to create an aesthetic predictor that can work well on one specific type of image (in this case, anime) while filtering out everything else. To achieve this, the model was trained on a set of 3080 hand-scored images with multiple refinement steps, where false positives and negatives would be added to the training set with corrected scores after each test run.
15
+
16
+ This model focuses on as few false positives as possible. Only having one type of media seems to help with this, as predictors that attempt to do both real life and 2D images tend to produce false positives. If one were to have a mixed dataset with both types of images, then the simplest solution would be to use two separate aesthetic score models and a classifier to pick the appropriate one to use.
17
+
18
+ #### Intentional biases
19
+
20
+ - Completely negative towards real life photos (ideal score of 0%)
21
+ - Strongly Negative towards text (subtitles, memes, etc) and manga panels
22
+ - Fairly negative towards 3D and to some extent 2.5D images
23
+ - Negative towards western cartoons and stylized images (chibi, parody)
24
+
25
+ #### Issues
26
+
27
+ - Tends to filter male characters due to being underrepresented in the training set
28
+ - Requires at least 1 subject to be present in the image - doesn't work for scenery/landscapes
29
+ - Noticeable positive bias towards anime characters with animal ears
30
+ - Hit-or-miss with AI generated images due to style/quality not being correlated
31
+
32
+ #### Out-of-scope
33
+
34
+ - This model is not meant for moderation/live filtering/etc
35
+ - The demo code is not meant to work with large-scale datasets and is therefore only single-threaded. If you're working on something that requires an optimized version that can work on pre-computed CLIP embeddings for faster iteration, feel free to [contact me](mailto:city@eruruu.net).
36
+
37
+ ### Usecases
38
+
39
+ The main usecase will be to provide baseline filtering on large datasets (i.e. a high pass filter). For this, the score brackets were decided as follows:
40
+
41
+ - <10% - Real life photos, noise, excessive text (subtitles, memes, etc)
42
+ - 10-20% - Manga panels, images with no subject, non-human subjects
43
+ - 20-40% - Sketches, oekaki, rough lineart (score depends on quality)
44
+ - 40-50% - Flat shading, TV anime screenshots, average images
45
+ - \>50% - "High quality" images based on my personal style preferences
46
+
47
+ The \>60% score range is intended to help pick out the "best" images from a dataset. One could use it to filter by score (i.e. using it as a band pass filter), but the scores above 50% are a lot more vague. Instead, I'd recommend sorting the dataset by score instead and setting a limit on the total number of images to select.
48
+
49
+ Top 100 images from a subset of danbooru2021:
50
+
51
+ ![AesPredv17_T100C](https://github.com/city96/CityClassifiers/assets/125218114/b7d8a167-a53a-46bb-8737-6c6c2a04f50f)
52
+
53
+ ### Training
54
+
55
+ The training script provided is initialized with the current model settings as the defaults (7e-6 LR, cosine scheduler, 100K steps).
56
+
57
+ ![loss](https://github.com/city96/CityClassifiers/assets/125218114/611ae144-1390-48d3-988d-59a03c4a2f26)
58
+
59
+ Final dataset score distribution for v1.8:
60
+ ```
61
+ 3080 images in dataset.
62
+ 0 - 31 |
63
+ 1 - 162 |||||
64
+ 2 - 533 |||||||||||||||||
65
+ 3 - 675 |||||||||||||||||||||
66
+ 4 - 690 ||||||||||||||||||||||
67
+ 5 - 576 ||||||||||||||||||
68
+ 6 - 228 |||||||
69
+ 7 - 95 |||
70
+ 8 - 54 |
71
+ 9 - 29
72
+ 10 - 7
73
+ raw - 0
74
+ ```
75
+
76
+ Version history:
77
+
78
+ - v1.0 - Initial test model with ~150 images to test viability
79
+ - v1.1 - Initialized top 5 score brackets with ~250 hand-picked images
80
+ - v1.2 - Manually scored ~2500 danbooru images for the main training set
81
+ - v1.3-v1.7 - Repeatedly ran the model against various datasets, adding the false negatives/positives to the training set to try and correct for various edgecases
82
+ - v1.8 - Added 3D and 2.5D images to the negative brackets to filter these as well
83
+
84
+ ### Architecture
85
+
86
+ The model itself is fairly simple. It takes embeddings from a CLIP model (in this case, `openai/clip-vit-large-patch14`) and expands them to 1024 dimensions. From there, a single block with residuals is followed by a few linear layers which converge down to the final output - a single float between 0.0 and 1.0.