Very Slow Infernece
I'm trying to run the 34B model with hugging face but it takes 200 seconds run inference on a single image, 13B only takes 5 seconds. I'm running on 6 L40S gpus (48 GB vram) each so vram shouldn't be an issue. Also tried running at 4-bit on one gpu and it still took about 80 seconds. Tried using liuhaotian/llava-v1.6-34b-tokenizer as the tokenizer but it still took longer than expected.
If I run it in gradio locally there are no issues. Will leave code of model import below
import torch
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
model_id = "llava-hf/llava-v1.6-34b-hf"
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
low_cpu_mem_usage=True,
device_map=1,
quantization_config=quantization_config
)
Found the issue, added use_cache=True to model.generate()