google/gemma-2b · Running finetuned inference on CPU

Apr 22, 2024

•

edited Apr 22, 2024

I have successfully finetuned gemma-2b for TextClassification using LORA and merged model by using merge_and_unload().
I then saved it to my local path using model.save_pretrained(f"{LOCAL_MODEL_PATH}", safe_serialization=False).
This was done on a GPU machine.

I am trying to load the above model on a CPU only device for inference using the following script
model = AutoModelForSequenceClassification.from_pretrained(LOCAL_MODEL_PATH, num_labels=2)

However, I see the issue

ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`

I do have accelerate installed and following are the libraries I installed before loading the model

tokenizers==0.15.2
transformers==4.39.3
torch==2.2.2
bitsandbytes==0.43.0
accelerate==0.28.0
peft==0.10.0

I see the same error even if I run the commands on a linux terminal.
Would like to get some help in resolving the issue and running the model on a CPU only machine for inference.

saikrishna6491 changed discussion title from Running finetuned inference on CPU to Running finetuned inference on CPU - accelerate ImportError Apr 22, 2024

lkv

Google org Jul 16, 2024

•

edited Sep 19, 2024

Hi @saikrishna6491 , BitsAndBytes 8-bit quantization requires the Accelerate library, which is not installed or an outdated version is being used, Ensure that accelerate is installed with pip install accelerate. and Update bitsandbytes to the latest version with pip install -i. Once these steps are completed, you should be able to load your fine-tuned model on a CPU-only machine for inference. Kindly try and let me know if issue is still persists. Thank you.