Why 4bit quantised is slower than fp16

#63
by kapil1611 - opened

I am wrapping my head around

Trying to understand Why A is faster than B

A.
tokenizer_large = AutoTokenizer.from_pretrained(f"google/flan-t5-large")
model_large = AutoModelForSeq2SeqLM.from_pretrained(f"google/flan-t5-large", torch_dtype=torch.float16, device_map="auto")
IS FASTER THEN

B.
model_id = "google/flan-t5-large"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False
)
model_large = AutoModelForSeq2SeqLM.from_pretrained(model_id, quantization_config=quantization_config)
tokenizer_large = AutoTokenizer.from_pretrained(model_id) (edited)

Google org

Hi @kapil1611
Please have a look at my comment here that details everthing: https://huggingface.co/google/flan-t5-large/discussions/17#6524249ca9a710554b0d3723
Closing it as it is a duplicate of that issue

ybelkada changed discussion status to closed

Sign up or log in to comment