CLIP ViT-B/16 Vision Encoder (GGUF)

GGUF conversion of openai/clip-vit-base-patch16 for use with CrispEmbed.

Architecture: CLIP ViT-B/16 vision encoder
Parameters: 86M
Output: 512-dimensional L2-normalized embeddings (768d internal, projected to 512d via visual_projection)
Input: 224x224 RGB image with CLIP normalization
Size: ~329 MB
Source: openai/clip-vit-base-patch16

Usage

# Embed a single image
crispembed -m clip-vit-base-patch16 --image photo.jpg

# Batch processing
crispembed -m clip-vit-base-patch16 --image-dir ./photos/ --output embeddings.bin

Cross-modal pairing

This model shares an embedding space with cstr/clip-text-base-GGUF. Use both to perform zero-shot image-text matching:

crispembed -m clip-vit-base-patch16 --image photo.jpg        # vision embedding
crispembed -m clip-text-base "a photo of a cat"               # text embedding

Cosine similarity between the two outputs measures image-text alignment.

Notes

All output embeddings are L2-normalized.
This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.

Downloads last month: 84

GGUF

Model size

86.2M params

Architecture

vit

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support