Quantized version of Mistral 7B (4bit or 8 bit)

#18
by ianuvrat - opened

Can't run on colab (free tier) . Can anyone guide me how to run Mistral 8 bit?

ianuvrat changed discussion title from Quantized version of Mistral 7B (4bit or * bit) to Quantized version of Mistral 7B (4bit or 8 bit)

Same issue.

After installing requirements using these two lines:

!pip install git+https://github.com/huggingface/transformers
!pip install accelerate bitsandbytes

I always use Use load_in_4bit=True and device_map='cuda' while loading model:

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1",  load_in_8bit=True, device_map='cuda')

But on colab CPU memory is getting OOM. I dont know why this is loading on CPU memory instead of GPU memory!!!

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
import torch

model_name_or_path = "mistralai/Mistral-7B-Instruct-v0.1"
config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
config.max_position_embeddings = 8096
quantization_config = BitsAndBytesConfig(
llm_int8_enable_fp32_cpu_offload=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
load_in_4bit=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
config=config,
trust_remote_code=True,
quantization_config=quantization_config,
device_map="auto",
offload_folder="./offload"
)

prompt = "[INST]your prompt[/INST]"
print("\n\n*** Generate:")
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
streamer = TextStreamer(tokenizer, skip_prompt= True)
output = model.generate(**inputs,
streamer=streamer,
max_new_tokens=512,
temperature=0.3,
top_k=20,
top_p=0.4,
repetition_penalty=1.1, do_sample=True)

For anyone using Colab, remove device_map='cuda' and it will load it onto the GPU correctly

Sign up or log in to comment