CLIP ViT-L/14 Vision Encoder (GGUF)

GGUF conversion of openai/clip-vit-large-patch14 for use with CrispEmbed.

  • Architecture: CLIP ViT-L/14 vision encoder
  • Parameters: 304M
  • Output: 768-dimensional L2-normalized embeddings (1024d internal, projected to 768d)
  • Input: 224x224 RGB image with CLIP normalization
  • Size: ~1.2 GB
  • Source: openai/clip-vit-large-patch14

Usage

# Embed a single image
crispembed -m clip-vit-large-patch14 --image photo.jpg

# Batch processing
crispembed -m clip-vit-large-patch14 --image-dir ./photos/ --output embeddings.bin

Cross-modal pairing

Shares an embedding space with cstr/clip-text-large-GGUF for zero-shot image-text matching.

Notes

  • All output embeddings are L2-normalized.
  • This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
Downloads last month
119
GGUF
Model size
0.3B params
Architecture
vit
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support