MobileCLIP CoreML Models

The models described here correspond to the CoreML conversion of the original MobileCLIP models from Apple. For more details, refer to MobileCLIP on HuggingFace and MobileCLIP on GitHub.

The models are separated for each subarchitecture:

  • MobileCLIP-S0: This subarchitecture is designed for lightweight and fast inference, making it suitable for edge devices with limited computational resources.
  • MobileCLIP-S1: This subarchitecture offers a balance between model complexity and performance, providing a good trade-off for various applications.
  • MobileCLIP-S2: This subarchitecture focuses on achieving higher accuracy, ideal for applications where performance can be slightly compromised for better results.
  • MobileCLIP-B: This subarchitecture aims at delivering the highest possible accuracy, optimized for environments with ample computational resources.

Each subarchitecture contains a TextEncoder and ImageEncoder that are separated into CoreML models for each subarchitecture:

Model CLIP Text CLIP Image
MobileCLIP-S0 clip_text_s0.mlpackage clip_image_s0.mlpackage
MobileCLIP-S1 clip_text_s1.mlpackage clip_image_s1.mlpackage
MobileCLIP-S2 clip_text_s2.mlpackage clip_image_s2.mlpackage
MobileCLIP-B clip_text_B.mlpackage clip_image_B.mlpackage

For detailed implementation and architecture specifics, refer to the MobileCLIP GitHub repository.

Example Usage

An example of using these CoreML models in a Swift application for iOS can be found in the CLIP-Finder project.

CoreML Parameters:

Model Input Name Input Shape Input DataType Output Name Output Shape Output DataType
CLIP Text input_text (1,77) INT32 output_embeddings (1,512) FLOAT16
Model Input Name Input Width Input Height Input ColorSpace Output Name Output Shape Output DataType
CLIP Image input_image 256 256 RGB output_embeddings (1,512) FLOAT16

CoreML Profile (Benchmark) on Apple M1

Prediction Times Apple M1 CPU + ANE CPU + GPU CPU Only
clip_image_s0 1.4ms 7.4ms 12.7ms
clip_image_s1 2.1ms 13.3ms 21.8ms
clip_image_s2 3.0ms 19.0ms 28.5ms
clip_image_b 12.4ms 36.2ms 38.1ms
clip_text_s0 1.1ms 4.1ms 4.8ms
clip_text_s1 2.0ms 7.1ms 9.5ms
clip_text_s2 2.0ms 7.1ms 10ms
clip_text_b 2.0ms 7.2ms 9.8ms

The profile was conducted using this tool: CoreMLProfiler. image/png

These are example scripts for performing the conversion to CoreML

  1. CLIPImageModel to CoreML Open In Colab

    • This notebook demonstrates the process of converting a CLIP image model to CoreML format.
  2. CLIPTextModel to CoreML Open In Colab

    • This notebook demonstrates the process of converting a CLIP text model to CoreML format.
Downloads last month
5
Inference API
Unable to determine this model's library. Check the docs .