Support for multiple images..

#19
by wamozart - opened

I'm trying to pass multiple images in the prompt and ask the model to find the differences between these two models.
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url1, stream=True).raw)
images = (image1, image2)
prompt = """
[INST] \nYou are giving two images of , determine if these are the same image or not and the reason. [/INST]
"""
It seems to ignore the second image. Any suggestion?

Llava Hugging Face org

Hey!

Yes, LLaVa-NeXT can accept multiple images as input as shown here. But since the model was not pre-trained with several images interleaved in one prompt, it might not perform well.

I recommend to fine-tune it for your use case, if you want decent quality in generating based on several images.

How should i use this model to generate captions for 3 millions images, like what resources to use(where to solve)? what will be the cost computation? what parallelizations to use?

Llava Hugging Face org

@LBS-LENKA you can either use TGI to serve it which comes with many optimizations under the hood: https://github.com/huggingface/text-generation-inference
I'm also building this project to optimize vision/multimodal models that you can find recipes inside depending on your hardware: https://github.com/merveenoyan/smol-vision

Sign up or log in to comment