model type llava_mistral is unrecognised

#1
by shshwtv - opened

ValueError: The checkpoint you are trying to load has model type llava_mistral but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

How to add this model type in Transformers?

Microsoft org

Hi - please check our repo (https://github.com/microsoft/LLaVA-Med?tab=readme-ov-file#contents) for the use of LLaVA-Med v1.5.

Hi, thanks for you response. I want to finetune your base-model on my own data. Let me know if it is possible to do so in near future? Thank you

Hi, did you solve the problem? I met the same problem that Transformers does not recognize this architecture. I download the model files and run them offline.

No. It's not possible. We switched to original llava model.

Hi, thanks for you response. I want to finetune your base-model on my own data. Let me know if it is possible to do so in near future? Thank you

Hello. I want to fine tune the Llava-med on my own dataset. Is it possible, did you find a solution?

I can load it successfully, steps are as following:

  1. clone repository from https://github.com/microsoft/LLaVA-Med and create virtual environment
  2. download parameters from this repository
  3. use the following code to load model:
from llava.model.builder import load_pretrained_model
tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path='<path_to_downloaded_repository(this)>',
        model_base=None,
        model_name='llava-med-v1.5-mistral-7b'
 )

Then I can use this model like any other models in Hugging Face transformers library.

Thanks for the update @mizukiQ

Also, wanted to ask - what's the max resolution of image that can be used? ViTL14 supports 224 x 224.
And what are various strategies to handle CT/ MR images.

I am also willing to join on discord or zoom to catchup and exchange-notes with other builders in the space.

Best,
Shash

I can load it successfully, steps are as following:

  1. clone repository from https://github.com/microsoft/LLaVA-Med and create virtual environment
  2. download parameters from this repository
  3. use the following code to load model:
from llava.model.builder import load_pretrained_model
tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path='<path_to_downloaded_repository(this)>',
        model_base=None,
        model_name='llava-med-v1.5-mistral-7b'
 )

Then I can use this model like any other models in Hugging Face transformers library.

thank you!
I also want to ask how should I prepare my dataset. I have images and captions. How should I convert them for fine tuning on lava med?
Is there any tutorial ?

Thanks for the update @mizukiQ

Also, wanted to ask - what's the max resolution of image that can be used? ViTL14 supports 224 x 224.
And what are various strategies to handle CT/ MR images.

I am also willing to join on discord or zoom to catchup and exchange-notes with other builders in the space.

Best,
Shash

I am not the official researcher of llava-med, here's some configuration from their code:
llava-med utilizes CLIPImageProcessor to handle image, and its crop_size is (336, 336), and it will split it into patches with shape (24, 24), where each patch is 14 x 14.
llava-med handles the images in different modalities (CT or MR) in the same way.

thank you!
I also want to ask how should I prepare my dataset. I have images and captions. How should I convert them for fine tuning on lava med?
Is there any tutorial ?

I publish some data loading code here (the code is from my reproduction of llava-med, it may be not good enough)
You can replace SlakeDataset class with your own dataset class (since image caption and vqa have the similar form I + Q -> T), just keep two interfaces the same. and finetune the model.

Sign up or log in to comment