How can I save or change the vision tower?

#40
by plll123 - opened

I only want to fine tune the vision tower while keeping the other part fixed. So, I want to save the vision tower to local file, and fine tune it.

So I use followed code to save model weight

vision_model = model.vision_tower
vision_model.save_pretrained ('Vision_Encoder/CLIP_vison')

and load weight with

tmp_model.from_pretrained ('Vision_Encoder/CLIP_vison')

but I got error as follows:

Traceback (most recent call last): File "/home/zxm/.conda/envs/mma/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained ) = cls._load_pretrained_model( File "/home/zxm/.conda/envs/mma/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4785, in _load_pretrained_model raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") 
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel: size mismatch for vision_model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). 
size mismatch for vision_model.encoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). 
size mismatch for vision_model.encoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). 
size mismatch for vision_model.encoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). 
size mismatch for vision_model.encoder.layers.0.mlp.fc1.weight: copying a param with shape torch.Size([2097152, 1]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). 
size mismatch for vision_model.encoder.layers.0.mlp.fc2.weight: copying a param with shape torch.Size([2097152, 1]) from checkpoint, the shape in current model is torch. Size([1024, 4096]).
Llava Hugging Face org

Pinging @RaushanTurganbay here

Llava Hugging Face org

@plll123 are you tuning only the vision part without the generative model, so the tuning goal is not CE loss with frozen LLM?

Can you share more details of how exactly the model is being loaded and what is the tmp_model in code snippet? If you saved the vision model only, you should be able to load it back on the vision model class (no the VLM class)

I’m trying to implement a functionality to save the vision_tower from the model locally and then test if it can be loaded back from the saved file. When I run my code snippet, I find that the vision_tower can be saved successfully.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration, CLIPVisionModel


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    # model_id, 
    "LLava model", # A model downloaded locally from Huggingface
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
    load_in_4bit=True
).to(device)

vision_encoder = model.vision_tower
vision_encoder.save_pretrained ('Vision_Encoder/CLIP_vision')

new_vision_encoder = CLIPVisionModel.from_pretrained('Vision_Encoder/CLIP_vision')

But an error occurs when I try to load it:

Traceback (most recent call last):
  File "/home/zxm/.conda/envs/mma/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/zxm/.conda/envs/mma/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4785, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
    size mismatch for vision_model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.0.mlp.fc1.weight: copying a param with shape torch.Size([2097152, 1]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
    size mismatch for vision_model.encoder.layers.0.mlp.fc2.weight: copying a param with shape torch.Size([2097152, 1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
    size mismatch for vision_model.encoder.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.1.self_attn.out_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024])...

size mismatch for vision_model.encoder.layers.23.self_attn.q_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.23.self_attn.out_proj.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
    size mismatch for vision_model.encoder.layers.23.mlp.fc1.weight: copying a param with shape torch.Size([2097152, 1]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
    size mismatch for vision_model.encoder.layers.23.mlp.fc2.weight: copying a param with shape torch.Size([2097152, 1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
    You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
Llava Hugging Face org

@plll123 i see now, it is because you load the llava model in 4 bits and then try to save it. You need to load llava in auto precision which is fp16 and then save it. That way your saved weights will be in the correct shape and can be loaded back in CLIPModel

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration, CLIPVisionModel

model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
)

vision_encoder = model.vision_tower
vision_encoder.save_pretrained ('CLIP_vision')

new_vision_encoder = CLIPVisionModel.from_pretrained('CLIP_vision')

print(new_vision_encoder)

Sign up or log in to comment