cognitivecomputations/dolphin-vision-72b · Running on MacOS

Hey folks, I've been trying to run on MacOS and am running into some difficulties.

Shards load ok, chat template works fine, but I'm running into issues with the vision tower loads. To get the initial loading working, I had to tweak the code you supplied to get around the flash_attn issue, as follows:

import os
from unittest.mock import patch
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.dynamic_module_utils import get_imports
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_debug()
# transformers.logging.disable_progress_bar()
# warnings.filterwarnings('ignore')

def fixed_get_imports(model_name: str | os.PathLike) -> list[str]:
    """Work around for running on MacOS; no flash_attn"""
    if not str(model_name).endswith("/modeling_llava_qwen2.py"):
        return get_imports(model_name)
    imports = get_imports(model_name)
    imports.remove("flash_attn")
    return imports

# set device
torch.set_default_device('cpu')  # or 'cpu'

model_name = 'cognitivecomputations/dolphin-vision-72b'

with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):
    # create model
    model = AutoModelForCausalLM.from_pretrained(
        	model_name,
        torch_dtype=torch.float16,
        trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/Downloads/test.jpeg')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

Everything works until loading the google/siglip model (google/siglip-so400m-patch14-384/model.safetensors) when I get this error while debugging.

- This IS expected if you are initializing SigLipVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing SigLipVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of SigLipVisionModel were initialized from the model checkpoint at /Users/willdee/Documents/Projects/llama.cpp/models/siglip-so400m-patch14-384.
If your task is similar to the task the model of the checkpoint was trained on, you can already use SigLipVisionModel for predictions without further training.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Would appreciate it if you could perhaps advise on what I can try to get around this.

Thanks much!
Will

cognitivecomputations
/

dolphin-vision-72b

Running on MacOS - issues with getting projector built.