google/flan-t5-xxl · Why 4bit quantised is slower than fp16

Oct 2, 2023

I am wrapping my head around

Trying to understand Why A is faster than B

A.
tokenizer_large = AutoTokenizer.from_pretrained(f"google/flan-t5-large")
model_large = AutoModelForSeq2SeqLM.from_pretrained(f"google/flan-t5-large", torch_dtype=torch.float16, device_map="auto")
IS FASTER THEN

B.
model_id = "google/flan-t5-large"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False
)
model_large = AutoModelForSeq2SeqLM.from_pretrained(model_id, quantization_config=quantization_config)
tokenizer_large = AutoTokenizer.from_pretrained(model_id) (edited)

lysandre

Google org Oct 3, 2023

cc @ybelkada

ybelkada

Google org Oct 9, 2023

Hi @kapil1611
Please have a look at my comment here that details everthing: https://huggingface.co/google/flan-t5-large/discussions/17#6524249ca9a710554b0d3723
Closing it as it is a duplicate of that issue

ybelkada changed discussion status to closed Oct 9, 2023