CoreML versions of laion/CLIP-ViT-H-14-laion2B-s32B-b79K.

On my baseline M1 they run about 4x faster than the equivalent pytorch models run on the mps device (~6 image embeddings per second vs 1.5 images/sec for torch+mps), and according to asitop profiling, using about 3/4 of the energy to do so (6W average vs 8W for torch+mps).

There are separate models for the image and text encoders. Sorry, I don't know how to put them both into one file.

Conversion code is in clip-to-coreml.ipynb.

Usage

You'll need to use the original CLIP preprocessor (or write your own preprocessing). eg:

from transformers import CLIPProcessor
import coremltools as ct
from PIL import Image

preprocessor = CLIPProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")

model_coreml_image = ct.models.MLModel('CLIP-ViT-H-14-laion2B-s32B-b79K.image-encoder.mlprogram')
model_coreml_text = ct.models.MLModel('CLIP-ViT-H-14-laion2B-s32B-b79K.text-encoder.mlprogram')

image = Image.open("example.jpg")
preprocessed_image = preprocessor(text=None, images=image, return_tensors="pt", padding=True)
image_embedding = model_coreml.predict({'input_image_preprocessed': preprocessed_image.pixel_values})['output_embedding']

text = 'example text'
preprocessed_text = preprocessor(text=text, images=None, return_tensors="pt", padding=True)
text_embedding = model_coreml_text.predict({'input_text_token_ids': preprocessed_text.input_ids})['output_embedding'])

Please credit me if you use this.


license: mit

Downloads last month
11
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.