ChrisGoringe
/

MixedQuantFlux

GGUF

Model card Files Files and versions Community

ChrisGoringe commited on Sep 19, 2024

Commit

44338a7

verified ·

1 Parent(s): c2987a7

Update README.md

Browse files

Files changed (1) hide show

README.md +26 -18

README.md CHANGED Viewed

@@ -24,6 +24,7 @@ where N_N is the average number of bits per parameter.
 ## Good choices to start with
 ```
 -  3_8 might work on a 8 GB card
 -  6_9 should be good for a 12 GB card
 -  8_2 is a good choice for 16 GB cards if you want to add LoRAs etc
@@ -33,28 +34,35 @@ where N_N is the average number of bits per parameter.
 ## Speed?
 On an A40 (plenty of VRAM), everything except the model identical,
-the time taken to generate an image (30 steps, deis sampler) was about 65% longer than for the full model.
 Quantised models will generally be slower because the weights have to be converted back into a native torch form when they are needed.
-## How is this optimised?
-The process for optimisation is as follows:
 - 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
 - For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
-- For each layer in turn, and for each quantization:
-  - A single layer was quantized
-  - The initial hidden states were processed by the modified layer stack
-  - The error (MSE) in the final hidden state was calculated
-- This gives a 'cost' for each possible layer quantization - how much different it is to the full model
-- An optimised quantization is one that gives the desired reduction in size for the smallest total cost
-  - A series of recipies for optimization have been created from the calculated costs
-- the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
-## Also note
-- Tests on using bitsandbytes quantizations showed they did not perform as well as the equivalent sized GGUF quants
-- Different quantizations of different parts of a layer gave significantly worse results
-- Leaving bias in 16 bit made no relevant difference (the 'patched' models generally do)
-- Costs were evaluated for the original Flux.1-dev model. They are probably essentially the same for finetunes

 ## Good choices to start with
 ```
+-  3_1 is the smallest yet - might work on 6 GB?
 -  3_8 might work on a 8 GB card
 -  6_9 should be good for a 12 GB card
 -  8_2 is a good choice for 16 GB cards if you want to add LoRAs etc
 ## Speed?
 On an A40 (plenty of VRAM), everything except the model identical,
+the time taken to generate an image (30 steps, deis sampler) was about 65% longer than for the full model (45s v 27s).
 Quantised models will generally be slower because the weights have to be converted back into a native torch form when they are needed.
+## How are these 'optimised'?
+The optimization is based on a cost metric, representing the error introduced by quantizing a specified layer with a specified quant.
+The data can be found [here](https://github.com/chrisgoringe/mixed-gguf-converter/tree/main/costs), and details of the process are below.
+From this, any possible quantization can be given a cost and a benefit (bits saved). The possible quantizations are then sorted from
+best (benefit/cost) to worst, and applied in order, until the required number of bits have been removed.
+### Calculating costs
+I created a database of the hidden states at the start and end of the transformer stack as follows:
 - 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
 - For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
+To calculate the cost of quantizing a specific layer to a specific quant:
+- A single layer in the transformer stack was quantized
+- The 240 initial hidden states were run through the stack
+- The cost is defined as the mean square difference between the outputs of the modified stack and the unmodified stack
+The cost, therefore, is a measure of how much change is introduced into the output hidden states by the quantization.
+## Not quantized
+In all these models, the 'in' blocks, the final layer blocks, and all normalization scale parameters are not quantized.
+These represent of 0.54% of all parameters in the model.
+In patch models (where the states were quantised using llama.cpp code), the biases are also not quantized.
+These represent 0.03% of all parameters in the model.