ChrisGoringe
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -24,6 +24,7 @@ where N_N is the average number of bits per parameter.
|
|
24 |
|
25 |
## Good choices to start with
|
26 |
```
|
|
|
27 |
- 3_8 might work on a 8 GB card
|
28 |
- 6_9 should be good for a 12 GB card
|
29 |
- 8_2 is a good choice for 16 GB cards if you want to add LoRAs etc
|
@@ -33,28 +34,35 @@ where N_N is the average number of bits per parameter.
|
|
33 |
## Speed?
|
34 |
|
35 |
On an A40 (plenty of VRAM), everything except the model identical,
|
36 |
-
the time taken to generate an image (30 steps, deis sampler) was about 65% longer than for the full model.
|
37 |
|
38 |
Quantised models will generally be slower because the weights have to be converted back into a native torch form when they are needed.
|
39 |
|
40 |
-
## How
|
41 |
|
42 |
-
The
|
|
|
43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
- 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
|
45 |
- For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
|
|
24 |
|
25 |
## Good choices to start with
|
26 |
```
|
27 |
+
- 3_1 is the smallest yet - might work on 6 GB?
|
28 |
- 3_8 might work on a 8 GB card
|
29 |
- 6_9 should be good for a 12 GB card
|
30 |
- 8_2 is a good choice for 16 GB cards if you want to add LoRAs etc
|
|
|
34 |
## Speed?
|
35 |
|
36 |
On an A40 (plenty of VRAM), everything except the model identical,
|
37 |
+
the time taken to generate an image (30 steps, deis sampler) was about 65% longer than for the full model (45s v 27s).
|
38 |
|
39 |
Quantised models will generally be slower because the weights have to be converted back into a native torch form when they are needed.
|
40 |
|
41 |
+
## How are these 'optimised'?
|
42 |
|
43 |
+
The optimization is based on a cost metric, representing the error introduced by quantizing a specified layer with a specified quant.
|
44 |
+
The data can be found [here](https://github.com/chrisgoringe/mixed-gguf-converter/tree/main/costs), and details of the process are below.
|
45 |
|
46 |
+
From this, any possible quantization can be given a cost and a benefit (bits saved). The possible quantizations are then sorted from
|
47 |
+
best (benefit/cost) to worst, and applied in order, until the required number of bits have been removed.
|
48 |
+
|
49 |
+
### Calculating costs
|
50 |
+
|
51 |
+
I created a database of the hidden states at the start and end of the transformer stack as follows:
|
52 |
- 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
|
53 |
- For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
|
54 |
+
|
55 |
+
To calculate the cost of quantizing a specific layer to a specific quant:
|
56 |
+
- A single layer in the transformer stack was quantized
|
57 |
+
- The 240 initial hidden states were run through the stack
|
58 |
+
- The cost is defined as the mean square difference between the outputs of the modified stack and the unmodified stack
|
59 |
+
|
60 |
+
The cost, therefore, is a measure of how much change is introduced into the output hidden states by the quantization.
|
61 |
+
|
62 |
+
## Not quantized
|
63 |
+
|
64 |
+
In all these models, the 'in' blocks, the final layer blocks, and all normalization scale parameters are not quantized.
|
65 |
+
These represent of 0.54% of all parameters in the model.
|
66 |
+
|
67 |
+
In patch models (where the states were quantised using llama.cpp code), the biases are also not quantized.
|
68 |
+
These represent 0.03% of all parameters in the model.
|