Best Way to Load a Model After Training w/o Requantizing

#37
by avstinpaxton - opened

Hi there,
I am new to transformers and working with larger models and could use some help. I fined tuned the model using adapters after converting it to 8-bit using the same setup as the notebook on the model page. My model is in a .bin format and ~6 Gb; is there a way to reload the model without reperforming 8 bit quantization again?

I am working on my flask application on my laptop and attempting to load the model onto my cpu. However, the bitnbytes library only works for GPU bc it is a cuda wrapper. Any ideas on how to use the model on a cpu while I am developing my flask application?

Any help is greatly appreciated. (:

deleted

did you manage to do that?

Seems the default way of loading and running these pytorch models is awfully slow. I wrote up how I used ggml to inference a gptj-6-b model here:
https://augchan42.github.io/2023/11/26/Its-Hard-to-find-an-Uncensored-Model.html
But basically, convert the pytorch_model.bin to ggml format, then use the ggml gpt-j bin to run inference. No need to quantize to 8bit, I used a float16 version

Sign up or log in to comment