What is the reason why gguf cannot be quantized?

#3
by huggingfacess - opened

Hi, I'm very excited about this model, it's huge and has potential for much more than that, but the delay in getting gguf made people confused about how to run it, and it's the main reason why it's not popular.

Even when quantized - the current gate methodology chugs ram. I'm not pleased with it - and am in the process of creating a new gate method. I have a v1.1 coming out in the next couple of days - that I will release with a quantized version and should solve the vram issues.

MOE; i think the most important is that they are 32 layers. and sharded 1b .... without pickle
each model is treated as individual so memory is a super problem ! which also indicates its loading problem later... the moe: need to be created with 1b or 2b models to keep the size down.. ie 3 experts and a base is 4 model(1b shards ++) so 1gb a shard! so if all models are sharded correctly (Small) then when quantizing the final model the safe-tensors should be of decent size to quantize to fp16 * Q8_0 (Q_8) runs very nice! ... I always check all layers are the same for each model (Sizes) if the Layer sizes are different is cannot merge correctly !... the normalized layers will be odd random number making is non precsise ... when you check my models .... you will see... I also found that if a model chops off a layer or two the model responds badly !... you need the full layers! .....MAYBE !

Also For loading i think they need the MIXTRAL version of AutoTokenizer? My Moe Loads Fine in llamaindex but does did not respond as a gguf (as it needed +9vram ) and as a safetensor it gave a blank response!

Also i use LLAMA_CPP to Quantize .

I have a model that is training currently using an entirely new method that should be far more vram efficient, I will then get some GGUF experts to conver it. I appreciate the thoughtful response.

yes i noticed that Gork_X despite being however many chunks all of the shards are small so essentially it is loadable , as well as chopping it into a slice and fine tuneing the last layer of the layers ... exactly what the superllms do ! to make it accessible ...hence in truth open AI if they released the full model it would also be unusable unless chopped and sliced... one of the merge methods give you the opportunity to select the desired layers ie 0-15 ie making a 3.5b instead of 0-32 (7b model)! ...

Sign up or log in to comment