Llava or LlavaNextForConditionalGeneration?

by mdoeir - opened Jul 22, 2024

Discussion

mdoeir

Jul 22, 2024

In the demo you are using the LlavaModel, but the llava-interleave model should be a LlavaNextModel?

RaushanTurganbay

Llava Hugging Face org Jul 22, 2024

@mdoeir yes, for HF implementation it's a LlavaModel. It's because we don't support anything except multimodal_patch_merge_type == "spatial_unpad" for LLaVaNeXT, and the arch of interleave models is same as in LLaVa if we follow the flags set by the original impl. So it's okay to use LlavaModel here

mdoeir

Jul 22, 2024

@RaushanTurganbay . Many thanks! Another question though,
I tried to run the pure transformer demo with the 0.5b model, but I got the following error:

Traceback (most recent call last):
File "/data/root/code/opensource/llava-next-interleave/llava_next_demo_hf.py", line 32, in
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
File "/data/root/miniconda3/envs/pl/lib/python3.10/site-packages/transformers/processing_utils.py", line 926, in apply_chat_template
raise ValueError(
ValueError: No chat template is set for this processor. Please either set the chat_template attribute, or provide a chat template as an argument. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information.

$ pip list | grep "transformers"
transformers 4.42.4
transformers-stream-generator 0.0.5

Did I miss something from the demo?

RaushanTurganbay

Llava Hugging Face org Jul 22, 2024

Yes, you have to update your transformers version via !pip install --upgrade git+https://github.com/huggingface/transformers.git to use chat templates. It was added just a few days ago and didnt yet make it to the PyPi release :)

lalalandw

Oct 21, 2024

•

edited Oct 21, 2024

Hello! I have same question about this which is how to process single image? I notice this repo using LLavaProcessor which is no "spatial unpad" method so that the dimension of single image will be processed into [1, channel, weight, width] instead of [patches, other dimension].
If I want to continue finetuning based on this "llava-next-interleave" checkpoint by using transformers with my own datasets, is this no patched single image ok? I thought this "llava-next-interleave" checkpoint was trained by using patched single image no matter pretraining or finetuning.

RaushanTurganbay

Llava Hugging Face org Oct 21, 2024

Yes, each image will be [1, channel, weight, width] shape consistent with what the inference was like in the original repo (https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/playground/demo/interleave_demo.py)

lalalandw

Oct 21, 2024

Thank you for your reply! But actually I'm asking about continue finetuning using my own dataset based on this checkpoint , not inference. What dimension is about the single image when I want to finetune? Does this checkpoint not support fineuning?

RaushanTurganbay

Llava Hugging Face org Oct 21, 2024

@lalalandw I see, but llava-OV afaik is not supposed to be trained with muti-patch because in that case we would also infer with multi-patch. Let me know if there is any resource that states explicitly how the model was trained, since the paper doesn't mention anyres and doesn't give much detail except for the backbones used

lalalandw

Oct 22, 2024

Thank you for you reply! Their released paper about LLava-Next-Interleave : LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models(https://arxiv.org/pdf/2407.07895) explicitly said that they used anyres to train sinlge images. Please see Section 5.2 (Multi-patch (single-image) Results.). The paper said 'We use the anyres training for single-image data, which divides an image into multiple patches, forming another multiimage setting'. Maybe the inference of single image with multi-patch and without multi-patch are both ok? But I'think LlavaNextForConditionalGeneration as the model type mabye better because it support anyres method.

RaushanTurganbay

Llava Hugging Face org Oct 24, 2024

Oke, lemme check that. But I don't think llava-next would fit here because it cannot do inference in non-patch way. In general if you want to tune with different parameters for patching/padding/newline, you can use the original impl in LLaVA-VL repo as it supports setting them in different combinations, while transformers takes only particular model architectures for integration. After tuning the model, you will be able to conver it to HF format with one of our conversion scripts

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment