Quantized version of Mistral 7B (4bit or 8 bit)

#18

by ianuvrat - opened Sep 29, 2023

Sep 29, 2023

Can't run on colab (free tier) . Can anyone guide me how to run Mistral 8 bit?

ianuvrat changed discussion title from Quantized version of Mistral 7B (4bit or * bit) to Quantized version of Mistral 7B (4bit or 8 bit) Sep 29, 2023

mnwato

Sep 29, 2023

•

edited Sep 29, 2023

Same issue.

After installing requirements using these two lines:

!pip install git+https://github.com/huggingface/transformers
!pip install accelerate bitsandbytes

I always use Use load_in_4bit=True and device_map='cuda' while loading model:

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1",  load_in_8bit=True, device_map='cuda')

But on colab CPU memory is getting OOM. I dont know why this is loading on CPU memory instead of GPU memory!!!

prudant

Sep 30, 2023

•

edited Sep 30, 2023

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
import torch

model_name_or_path = "mistralai/Mistral-7B-Instruct-v0.1"
config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
config.max_position_embeddings = 8096
quantization_config = BitsAndBytesConfig(
llm_int8_enable_fp32_cpu_offload=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
load_in_4bit=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
config=config,
trust_remote_code=True,
quantization_config=quantization_config,
device_map="auto",
offload_folder="./offload"
)

prompt = "[INST]your prompt[/INST]"
print("\n\n*** Generate:")
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
streamer = TextStreamer(tokenizer, skip_prompt= True)
output = model.generate(**inputs,
streamer=streamer,
max_new_tokens=512,
temperature=0.3,
top_k=20,
top_p=0.4,
repetition_penalty=1.1, do_sample=True)

jonflynn

Mar 1, 2024

For anyone using Colab, remove device_map='cuda' and it will load it onto the GPU correctly

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment