CLIP ViT-B/32 β€” Core AI export (official recipe)

fp16 static export of openai/clip-vit-base-patch32 via apple/coreai-models' official recipe (models/clip/export.py), with one change: text inputs are padded to the full 77-token context (padding="max_length") so free-text queries work, instead of the recipe's 7-token example trace.

Runs out of the box with CoreAIKit's ImageTextEncoder:

let encoder = try await ImageTextEncoder()   // downloads this repo
let imageVec = try await encoder.encode(image: cgImage)
let textVec  = try await encoder.encode(text: "red bike at the beach")
let score = ImageTextEncoder.cosineSimilarity(imageVec, textVec)

Bundle layout

model/
β”œβ”€β”€ clip-vit-base-patch32_float16_static.aimodel
└── tokenizer.json

Graph contract

name shape dtype
input pixel_values [1, 3, 224, 224] fp16
input input_ids [3, 77] int32
input attention_mask [3, 77] int32
output image_embeds [1, 512] fp16, L2-normalized
output text_embeds [3, 512] fp16, L2-normalized
output logits_per_image / logits_per_text [1, 3] / [3, 1] fp16

Preprocessing: 224Γ—224 resize + CLIP mean/std normalization (handled by ImageTextEncoder).

Performance

M4 Max: ~3.7 ms per image on the Neural Engine (fp16). Requires macOS 27 beta / iOS 27 beta (device β€” the CoreAI framework is not in the iOS Simulator SDK).

License

Model weights: MIT (OpenAI CLIP); see the upstream repo. Export recipe: BSD-3-Clause (apple/coreai-models).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support