Spaces:

coreml-projects
/

transformers-to-coreml

Paused

App Files Files Community

Error converting Salesforce/blip-image-captioning-base

#15

by shreyanmitra - opened Jun 13, 2023

Discussion

shreyanmitra

Jun 13, 2023

Hello,

I am very new to HuggingFace and machine learning in general. I understand that the Blip model is not supported for conversion to coreml. Is there a way I can write my own conversion code?

Thanks

Conversion Settings:

        Model: Salesforce/blip-image-captioning-base
        Task: None
        Framework: None
        Compute Units: None
        Precision: None
        Tolerance: None
        Push to: None

        Error: "blip is not supported yet. Only ['bart', 'beit', 'bert', 'big_bird', 'bigbird_pegasus', 'blenderbot', 'blenderbot_small', 'bloom', 'convnext', 'ctrl', 'cvt', 'data2vec', 'distilbert', 'ernie', 'gpt2', 'gpt_neo', 'levit', 'm2m_100', 'marian', 'mobilebert', 'mobilevit', 'mvp', 'pegasus', 'plbart', 'roberta', 'roformer', 'segformer', 'splinter', 'squeezebert', 't5', 'vit', 'yolos'] are supported. If you want to support blip please propose a PR or open up an issue."

pcuenq

Core ML Projects org Jun 14, 2023

Hello @99s42m!

Thanks for reporting this! We'll take a look and see if we can add support for blip soon. Meanwhile, you could try to use coremltools directly. coremltools is a Python package created by Apple that can convert PyTorch and Tensorflow models to Core ML. This conversion Space is based on exporters, which in turn uses coremltools under the hood.

shreyanmitra

Jun 17, 2023

•

edited Jun 17, 2023

@pcuenq Thank you so much for your response.

Here's where I have gotten so far:

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://images.nationalgeographic.org/image/upload/t_edhub_resource_key_image/v1638882947/EducationHub/photos/tourists-at-victoria-falls.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "The main geographical feature in this photo is a"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens = 20)
print(processor.decode(out[0], skip_special_tokens=True))

import coremltools as ct
import torch
import torchvision
example_input = torch.rand(1, 3, 224, 224) 
traced_model = torch.jit.trace(model, inputs['input_ids'])
out = traced_model(example_input)
 )

The above code throws the following error:
RuntimeError: Input type (long int) and bias type (float) should be the same

I understand that you are busy and this might be a basic question, but any help would be greatly appreciated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment