Add support for AutoModelForCausalLM or LlavaForConditionalGeneration loading

#1
by RonanMcGovern - opened

Could you add support for this?

Right now the config.json file in these repos is just the Llama 3 config.

When trying to load with LlavaForConditionalGeneraton.from_pretrained(), I get:

Some weights of LlavaForConditionalGeneration were not initialized from the model checkpoint at xtuner/llava-llama-3-8b-v1_1 and are newly initialized: ['model.language_model.lm_head.weight', 'model.language_model.model.embed_tokens.weight', 'model.language_model.model.layers.0.input_layernor

When trying to run evaluation, even just running the processor with:

inputs = processor(prompt, raw_image, return_tensors='pt').to(DEVICE, DTYPE)

I get:

ValueError                                Traceback (most recent call last)
Cell In[64], line 15
     13 raw_image = sample['image']
---> 15 inputs = processor(prompt, raw_image, return_tensors='pt').to(DEVICE, DTYPE)
     16 output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
     17 print('model: ', processor.decode(output[0][2:], skip_special_tokens=True))

File /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2858, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2856     if not self._in_target_context_manager:
   2857         self._switch_to_input_mode()
-> 2858     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   2859 if text_target is not None:
   2860     self._switch_to_target_mode()

File /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2922, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2916     raise ValueError(
   2917         "text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) "
   2918         "or `List[List[str]]` (batch of pretokenized examples)."
   2919     )
   2921 if text_pair is not None and not _is_valid_text_input(text_pair):
-> 2922     raise ValueError(
   2923         "text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) "
   2924         "or `List[List[str]]` (batch of pretokenized examples)."
   2925     )
   2927 if is_split_into_words:
   2928     is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

ValueError: text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
RonanMcGovern changed discussion title from Add support for AutoModelForCausalLM loading to Add support for AutoModelForCausalLM or LlavaForConditionalGeneration loading
xtuner org

@RonanMcGovern Thank you very much for your feedback. We will strive to provide support as soon as possible.

This comment has been hidden

Hi,
Can you please help me out how to load and do prediction llava llama3 vision model

xtuner org

Hi @darkshadow @RonanMcGovern

We have released weights similar to LLaVA v1.5/v1.6 architecture here. You can try this model with your workflow!

ok, great, I'll try loading that with a LLaVA flow.

xtuner org
edited Apr 26

Ok, that's great, thanks!

RonanMcGovern changed discussion status to closed

Sign up or log in to comment