CLIP ViT-B/32 — Core AI export (official recipe)

fp16 static export of openai/clip-vit-base-patch32 via apple/coreai-models' official recipe (models/clip/export.py), with one change: text inputs are padded to the full 77-token context (padding="max_length") so free-text queries work, instead of the recipe's 7-token example trace.

Runs out of the box with CoreAIKit's ImageTextEncoder:

let encoder = try await ImageTextEncoder()   // downloads this repo
let imageVec = try await encoder.encode(image: cgImage)
let textVec  = try await encoder.encode(text: "red bike at the beach")
let score = ImageTextEncoder.cosineSimilarity(imageVec, textVec)

Bundle layout

model/
├── clip-vit-base-patch32_float16_static.aimodel
└── tokenizer.json

Graph contract

	name	shape	dtype
input	`pixel_values`	[1, 3, 224, 224]	fp16
input	`input_ids`	[3, 77]	int32
input	`attention_mask`	[3, 77]	int32
output	`image_embeds`	[1, 512]	fp16, L2-normalized
output	`text_embeds`	[3, 512]	fp16, L2-normalized
output	`logits_per_image` / `logits_per_text`	[1, 3] / [3, 1]	fp16

Preprocessing: 224×224 resize + CLIP mean/std normalization (handled by ImageTextEncoder).

Performance

M4 Max: ~3.7 ms per image on the Neural Engine (fp16). Requires macOS 27 beta / iOS 27 beta (device — the CoreAI framework is not in the iOS Simulator SDK).

License

Model weights: MIT (OpenAI CLIP); see the upstream repo. Export recipe: BSD-3-Clause (apple/coreai-models).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support