CLIP ViT-B/16 Vision Encoder (GGUF)

GGUF conversion of openai/clip-vit-base-patch16 for use with CrispEmbed.

  • Architecture: CLIP ViT-B/16 vision encoder
  • Parameters: 86M
  • Output: 512-dimensional L2-normalized embeddings (768d internal, projected to 512d via visual_projection)
  • Input: 224x224 RGB image with CLIP normalization
  • Size: ~329 MB
  • Source: openai/clip-vit-base-patch16

Usage

# Embed a single image
crispembed -m clip-vit-base-patch16 --image photo.jpg

# Batch processing
crispembed -m clip-vit-base-patch16 --image-dir ./photos/ --output embeddings.bin

Cross-modal pairing

This model shares an embedding space with cstr/clip-text-base-GGUF. Use both to perform zero-shot image-text matching:

crispembed -m clip-vit-base-patch16 --image photo.jpg        # vision embedding
crispembed -m clip-text-base "a photo of a cat"               # text embedding

Cosine similarity between the two outputs measures image-text alignment.

Notes

  • All output embeddings are L2-normalized.
  • This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
Downloads last month
84
GGUF
Model size
86.2M params
Architecture
vit
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support