gustproof commited on
Commit
ae87087
1 Parent(s): 4293edb

Create eval_basics.md

Browse files
Files changed (1) hide show
  1. posts/eval_basics.md +111 -0
posts/eval_basics.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluating Anime Models Systematically - Basics
2
+
3
+ I was trying to refine my character models when I realized how I've been making models is really inefficient.
4
+ It typically goes like, tweak some configs or data, try some random prompts, see if they look okay.
5
+ It should be helpful to establish a well-defined procedure.
6
+ Then it's apparent that to evaluate fine-tuned models, knowing and quantifying how the base models perform as a baseline is essential.
7
+ So here I am, trying to evaluate base models.
8
+
9
+ I collected 1000 random prompts from Danbooru posts from 2021-2022 with the query `chartags:0 -is:child -rating:e,q order:random score:>=10 filetype:jpg,png,webp ratio:0.45..2.1`
10
+ and generated 1000 640x640 images with them for each of 3 widely-used anime models:
11
+ [animefull-latest](https://huggingface.co/deepghs/animefull-latest),
12
+ [Counterfeit-V3.0](https://civitai.com/models/4468?modelVersionId=57618), [MeinaMix_V11](https://huggingface.co/Meina/MeinaMix_V11).
13
+
14
+
15
+ A model can be evaluated over a number of aspects: fidelity, text-image alignment, aesthetics, diversity. Let's go through them one by one.
16
+
17
+ ## Fidelity
18
+
19
+ Generated images should be indistinguishable from real ones. They should make sense and not contain obvious errors such as extra limbs, mutated fingers, glitches or random blobs.
20
+ In literature, it's common to use metrics based on distribution distance, such as FID and IS. I calculated the KID score of the 3 sets of images against the 1000 real images.
21
+
22
+ | model | KID (lower better) |
23
+ |---|--|
24
+ | animefull-latest | 0.01192 |
25
+ | Counterfeit-V3.0 | 0.01807 |
26
+ | MeinaMix_V11 | 0.01345 |
27
+
28
+ It seems like that KID does not align with human evaluation, which would generally rate animefull-latest as the worst one.
29
+ This is kind of expected, since models with strong style would have a different image feature distribution than random real images.
30
+
31
+ I also tried multimodal LLMs, including GPT-4V and LLaVA, and unfortunately find them quite useless. GPT-4V is supposedly SOTA, but it's clear that it quite useless at spotting generation errors.
32
+
33
+
34
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/bAtPwnb0TMrKZrneQbxvi.png)
35
+
36
+
37
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/mPrbphtOxfNtK1ClhEgGV.png)
38
+
39
+ So currently I can't find a process that computes a fidelity score for anime models. Have to wait for someone to train a specialized model for now.
40
+
41
+
42
+ ## Text-Image Alignment
43
+
44
+ Generated images should not contradict the text prompts. A popular metric is the CLIP score, which is the cosine similarity of the projected CLIP embeddings.
45
+ There's also [PickScore_v1](https://huggingface.co/yuvalkirstain/PickScore_v1) which is fine-tuned on human preference data.
46
+ These are not well-suited for anime models due to how different Booru tagging is from regular images.
47
+
48
+ Model using booru-tag prompts can be evaluated with a tagger. Specifically, I used [wd-v1-4-moat-tagger-v2](https://huggingface.co/SmilingWolf/wd-v1-4-moat-tagger-v2) with a threshold of 0.35.
49
+ A tag accuracy score can be defined as `#{prompted tags correctly reproduced}/#{prompted tags}`. The accuracy is macro-averaged over all images. Here are the scores:
50
+
51
+ | model | tag accuracy (higher better) |
52
+ |---|--|
53
+ | animefull-latest | 0.464328 |
54
+ | Counterfeit-V3.0 | 0.434574 |
55
+ | MeinaMix_V11 | 0.375389 |
56
+
57
+ It can be seen that fine-tunes or merges may produce nicer images but at the cost of controllability.
58
+
59
+ ## Aesthetics
60
+
61
+ Images should be pretty. While this is generally subjective, there are models that give an aesthetic score, either averaged from many people's preferences or personalized.
62
+ There are CLIP based models ([aesthetic-predictor](https://github.com/LAION-AI/aesthetic-predictor), [improved-aesthetic-predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor))
63
+ and some custom models ([anime-aesthetic](https://huggingface.co/spaces/skytnt/anime-aesthetic-predict), [cafe_aesthetic](https://huggingface.co/cafeai/cafe_aesthetic)).
64
+
65
+ I tested averaged improved-aesthetic-predictor and anime-aesthetic:
66
+
67
+ | model | improved-aesthetic-predictor (higher better) | anime-aesthetic (higher better) |
68
+ |---|--|-|
69
+ | animefull-latest | 6.124954 | 0.639767 |
70
+ | Counterfeit-V3.0 | 6.359464 | 0.789190 |
71
+ | MeinaMix_V11 | 6.474662 | 0.829989 |
72
+
73
+ The two scores appears to agree.
74
+
75
+ Interestingly, GPT-4V does a reasonable job at this.
76
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/vhk-IS0rZl1urd5Mqlzi9.png)
77
+
78
+ ## Diversity
79
+
80
+ Even with the same prompt, given different random seeds, generated images should not be repetitive.
81
+ There's this DIV score defined in the [Dreambooth paper](https://arxiv.org/pdf/2208.12242.pdf), which calculates image similarity with LPIPS.
82
+ For this particular set of images, this metric is not applicable, and I will leave it to a future update.
83
+
84
+ ## Conclusions
85
+
86
+ It's possible to programmatically generate some numbers given a base model. We can use the numbers as a proxy of the model's overall performance.
87
+
88
+
89
+ ## Miscellaneous notes
90
+
91
+ I used diffusers and 13 images from animefull-latest came out as solid black images for unknown reasons even with the safety checker disabled and single precision VAE.
92
+ These images and their counterparts were excluded in metrics calculation.
93
+
94
+ The images and prompts can be found [here](https://huggingface.co/datasets/gustproof/sd-data/tree/main/db1k).
95
+
96
+ It's possible that some models perform better with special configs, but for simplicity I kept them the same.
97
+
98
+ The code for image generation and metrics is quite messy so I'm will not upload it right now, but feel free to ask questions or give suggestions.
99
+
100
+ I probably would create a fidelity model eventually if no one does, but it will take a while.
101
+
102
+
103
+
104
+ Prompts with more tags have lower tag accuracy
105
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/-CCWbnriWllqZKa5Le36M.png)
106
+
107
+ The affect of tag position is measurable albeit less pronounced. The trend at the pos 20-25 may be due to the 77-token limit wraparound.
108
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e3c06a99a032b1c95226a9/b0t53PqMR2oNQX23ZTw5h.png)
109
+
110
+ The next post will be about evaluating character models.
111
+