Running finetuned inference on CPU - accelerate ImportError

#54
by saikrishna6491 - opened

I have successfully finetuned gemma-2b for TextClassification using LORA and merged model by using merge_and_unload().
I then saved it to my local path using model.save_pretrained(f"{LOCAL_MODEL_PATH}", safe_serialization=False).
This was done on a GPU machine.

I am trying to load the above model on a CPU only device for inference using the following script
model = AutoModelForSequenceClassification.from_pretrained(LOCAL_MODEL_PATH, num_labels=2)

However, I see the issue

ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`

I do have accelerate installed and following are the libraries I installed before loading the model

tokenizers==0.15.2
transformers==4.39.3
torch==2.2.2
bitsandbytes==0.43.0
accelerate==0.28.0
peft==0.10.0

I see the same error even if I run the commands on a linux terminal.
Would like to get some help in resolving the issue and running the model on a CPU only machine for inference.

saikrishna6491 changed discussion title from Running finetuned inference on CPU to Running finetuned inference on CPU - accelerate ImportError

Sign up or log in to comment