CLIP ViT-B/32 β Core AI export (official recipe)
fp16 static export of openai/clip-vit-base-patch32
via apple/coreai-models' official recipe (models/clip/export.py), with one change: text
inputs are padded to the full 77-token context (padding="max_length") so free-text
queries work, instead of the recipe's 7-token example trace.
Runs out of the box with CoreAIKit's
ImageTextEncoder:
let encoder = try await ImageTextEncoder() // downloads this repo
let imageVec = try await encoder.encode(image: cgImage)
let textVec = try await encoder.encode(text: "red bike at the beach")
let score = ImageTextEncoder.cosineSimilarity(imageVec, textVec)
Bundle layout
model/
βββ clip-vit-base-patch32_float16_static.aimodel
βββ tokenizer.json
Graph contract
| name | shape | dtype | |
|---|---|---|---|
| input | pixel_values |
[1, 3, 224, 224] | fp16 |
| input | input_ids |
[3, 77] | int32 |
| input | attention_mask |
[3, 77] | int32 |
| output | image_embeds |
[1, 512] | fp16, L2-normalized |
| output | text_embeds |
[3, 512] | fp16, L2-normalized |
| output | logits_per_image / logits_per_text |
[1, 3] / [3, 1] | fp16 |
Preprocessing: 224Γ224 resize + CLIP mean/std normalization (handled by
ImageTextEncoder).
Performance
M4 Max: ~3.7 ms per image on the Neural Engine (fp16). Requires macOS 27 beta / iOS 27 beta (device β the CoreAI framework is not in the iOS Simulator SDK).
License
Model weights: MIT (OpenAI CLIP); see the upstream repo. Export recipe: BSD-3-Clause (apple/coreai-models).
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support