fguzman82/MobileCLIP · Hugging Face

MobileCLIP CoreML Models

The models described here correspond to the CoreML conversion of the original MobileCLIP models from Apple. For more details, refer to MobileCLIP on HuggingFace and MobileCLIP on GitHub.

The models are separated for each subarchitecture:

MobileCLIP-S0: This subarchitecture is designed for lightweight and fast inference, making it suitable for edge devices with limited computational resources.
MobileCLIP-S1: This subarchitecture offers a balance between model complexity and performance, providing a good trade-off for various applications.
MobileCLIP-S2: This subarchitecture focuses on achieving higher accuracy, ideal for applications where performance can be slightly compromised for better results.
MobileCLIP-B: This subarchitecture aims at delivering the highest possible accuracy, optimized for environments with ample computational resources.

Each subarchitecture contains a TextEncoder and ImageEncoder that are separated into CoreML models for each subarchitecture:

Model	CLIP Text	CLIP Image
MobileCLIP-S0	clip_text_s0.mlpackage	clip_image_s0.mlpackage
MobileCLIP-S1	clip_text_s1.mlpackage	clip_image_s1.mlpackage
MobileCLIP-S2	clip_text_s2.mlpackage	clip_image_s2.mlpackage
MobileCLIP-B	clip_text_B.mlpackage	clip_image_B.mlpackage

For detailed implementation and architecture specifics, refer to the MobileCLIP GitHub repository.

Example Usage

An example of using these CoreML models in a Swift application for iOS can be found in the CLIP-Finder project.

CoreML Parameters:

Model	Input Name	Input Shape	Input DataType	Output Name	Output Shape	Output DataType
CLIP Text	input_text	(1,77)	INT32	output_embeddings	(1,512)	FLOAT16

Model	Input Name	Input Width	Input Height	Input ColorSpace	Output Name	Output Shape	Output DataType
CLIP Image	input_image	256	256	RGB	output_embeddings	(1,512)	FLOAT16

CoreML Profile (Benchmark) on Apple M1

Prediction Times Apple M1	CPU + ANE	CPU + GPU	CPU Only
clip_image_s0	1.4ms	7.4ms	12.7ms
clip_image_s1	2.1ms	13.3ms	21.8ms
clip_image_s2	3.0ms	19.0ms	28.5ms
clip_image_b	12.4ms	36.2ms	38.1ms

clip_text_s0	1.1ms	4.1ms	4.8ms
clip_text_s1	2.0ms	7.1ms	9.5ms
clip_text_s2	2.0ms	7.1ms	10ms
clip_text_b	2.0ms	7.2ms	9.8ms

The profile was conducted using this tool: CoreMLProfiler.

These are example scripts for performing the conversion to CoreML

CLIPImageModel to CoreML
- This notebook demonstrates the process of converting a CLIP image model to CoreML format.
CLIPTextModel to CoreML
- This notebook demonstrates the process of converting a CLIP text model to CoreML format.