gptq

#1
by KnutJaegersberg - opened

@TheBloke I got the nllb moe to work with 4 bit of bitsandbytes. Since recently we can store the weights in 4 bit, it works, uploading now.
the weights load quickly, like 20 seconds vs 15 minutes (into 4 bit). But execution is slow.
I don't know about this, but since you got the mistral moe into gptq, is it possible to quantize this model, too? It's unique. I guess it outperforms the recent seamless translation models of meta. it's just huge. I guess there is no support in gptq nor gguf.
but this is a unique resource for machine translation, it's properly unsurpassed open access weights, although they are already dated.

this quantization uses 37gb, with double quantization it is more like 35gb vram. I have not even tried long translations yet, only the examples. It's slow. but it is properly the best.

Sign up or log in to comment