llava-hf/llava-v1.6-34b-hf · Very Slow Infernece

Apr 12, 2024

I'm trying to run the 34B model with hugging face but it takes 200 seconds run inference on a single image, 13B only takes 5 seconds. I'm running on 6 L40S gpus (48 GB vram) each so vram shouldn't be an issue. Also tried running at 4-bit on one gpu and it still took about 80 seconds. Tried using liuhaotian/llava-v1.6-34b-tokenizer as the tokenizer but it still took longer than expected.

If I run it in gradio locally there are no issues. Will leave code of model import below

import torch
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

model_id = "llava-hf/llava-v1.6-34b-hf"

processor = LlavaNextProcessor.from_pretrained(model_id)

model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
low_cpu_mem_usage=True,
device_map=1,
quantization_config=quantization_config
)

LucasGarvey

Apr 12, 2024

Found the issue, added use_cache=True to model.generate()

LucasGarvey changed discussion status to closed Apr 12, 2024