ChrisGoringe commited on
Commit
44338a7
·
verified ·
1 Parent(s): c2987a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -18
README.md CHANGED
@@ -24,6 +24,7 @@ where N_N is the average number of bits per parameter.
24
 
25
  ## Good choices to start with
26
  ```
 
27
  - 3_8 might work on a 8 GB card
28
  - 6_9 should be good for a 12 GB card
29
  - 8_2 is a good choice for 16 GB cards if you want to add LoRAs etc
@@ -33,28 +34,35 @@ where N_N is the average number of bits per parameter.
33
  ## Speed?
34
 
35
  On an A40 (plenty of VRAM), everything except the model identical,
36
- the time taken to generate an image (30 steps, deis sampler) was about 65% longer than for the full model.
37
 
38
  Quantised models will generally be slower because the weights have to be converted back into a native torch form when they are needed.
39
 
40
- ## How is this optimised?
41
 
42
- The process for optimisation is as follows:
 
43
 
 
 
 
 
 
 
44
  - 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
45
  - For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
46
- - For each layer in turn, and for each quantization:
47
- - A single layer was quantized
48
- - The initial hidden states were processed by the modified layer stack
49
- - The error (MSE) in the final hidden state was calculated
50
- - This gives a 'cost' for each possible layer quantization - how much different it is to the full model
51
- - An optimised quantization is one that gives the desired reduction in size for the smallest total cost
52
- - A series of recipies for optimization have been created from the calculated costs
53
- - the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
54
-
55
- ## Also note
56
-
57
- - Tests on using bitsandbytes quantizations showed they did not perform as well as the equivalent sized GGUF quants
58
- - Different quantizations of different parts of a layer gave significantly worse results
59
- - Leaving bias in 16 bit made no relevant difference (the 'patched' models generally do)
60
- - Costs were evaluated for the original Flux.1-dev model. They are probably essentially the same for finetunes
 
24
 
25
  ## Good choices to start with
26
  ```
27
+ - 3_1 is the smallest yet - might work on 6 GB?
28
  - 3_8 might work on a 8 GB card
29
  - 6_9 should be good for a 12 GB card
30
  - 8_2 is a good choice for 16 GB cards if you want to add LoRAs etc
 
34
  ## Speed?
35
 
36
  On an A40 (plenty of VRAM), everything except the model identical,
37
+ the time taken to generate an image (30 steps, deis sampler) was about 65% longer than for the full model (45s v 27s).
38
 
39
  Quantised models will generally be slower because the weights have to be converted back into a native torch form when they are needed.
40
 
41
+ ## How are these 'optimised'?
42
 
43
+ The optimization is based on a cost metric, representing the error introduced by quantizing a specified layer with a specified quant.
44
+ The data can be found [here](https://github.com/chrisgoringe/mixed-gguf-converter/tree/main/costs), and details of the process are below.
45
 
46
+ From this, any possible quantization can be given a cost and a benefit (bits saved). The possible quantizations are then sorted from
47
+ best (benefit/cost) to worst, and applied in order, until the required number of bits have been removed.
48
+
49
+ ### Calculating costs
50
+
51
+ I created a database of the hidden states at the start and end of the transformer stack as follows:
52
  - 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
53
  - For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
54
+
55
+ To calculate the cost of quantizing a specific layer to a specific quant:
56
+ - A single layer in the transformer stack was quantized
57
+ - The 240 initial hidden states were run through the stack
58
+ - The cost is defined as the mean square difference between the outputs of the modified stack and the unmodified stack
59
+
60
+ The cost, therefore, is a measure of how much change is introduced into the output hidden states by the quantization.
61
+
62
+ ## Not quantized
63
+
64
+ In all these models, the 'in' blocks, the final layer blocks, and all normalization scale parameters are not quantized.
65
+ These represent of 0.54% of all parameters in the model.
66
+
67
+ In patch models (where the states were quantised using llama.cpp code), the biases are also not quantized.
68
+ These represent 0.03% of all parameters in the model.