HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit · [Error] Error when executing the example code

Mar 25

•

Hi,

If I run the model with example code within the folder, I get this error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-4-fd9293295145> in <cell line: 2>()
      1 import torch
----> 2 from modeling_siglip import SiglipVisionModel
      3 
      4 DEVICE = torch.device("cuda:0")
      5 PATCH_SIZE = 14

/content/siglip-so400m-14-980-flash-attn2-navit/modeling_siglip.py in <module>
     40     replace_return_docstrings,
     41 )
---> 42 from .configuration_siglip import SiglipConfig, SiglipTextConfig, SiglipVisionConfig
     43 
     44 

ImportError: attempted relative import with no known parent package

If I replace the source code files in transformers (e.g., modeling_siglip.py) with the source code files in this repo, I get this error:

Actually there is the argument:

If I run the code with:

import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-384-flash-attn2", trust_remote_code=True)
model.eval().cuda().half()

pixel_values = torch.randn(1, 3, 384, 384).cuda().half()
output= model.vision_model(pixel_values)

It does work. But the model only accepts images with 384*384 resolution. If I send an image with 512*512 resolution, I will get a dimension mismatch error from the position embedding.

Could you please modify the example code so it can be executed? How to run the model successfully with Google Colab?

VictorSanh

HuggingFaceM4 org Mar 26

Hi @StarCycle
Have you tried a model = AutoModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit"); model.vision_model?

StarCycle

Mar 27

Thanks @VictorSanh !

It should be OK with

import torch
from transformers import AutoModel

pixel_values = torch.randn(1, 3, 224, 384).cuda().half() # any resolution here
model = AutoModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit", trust_remote_code=True)
model.eval().cuda().half()
output= model.vision_model(pixel_values)

Is it necessary to specify the patch_attention_mask in the example of README?