CLIP ViT-L/14 Vision Encoder (GGUF)

GGUF conversion of openai/clip-vit-large-patch14 for use with CrispEmbed.

Architecture: CLIP ViT-L/14 vision encoder
Parameters: 304M
Output: 768-dimensional L2-normalized embeddings (1024d internal, projected to 768d)
Input: 224x224 RGB image with CLIP normalization
Size: ~1.2 GB
Source: openai/clip-vit-large-patch14

Usage

# Embed a single image
crispembed -m clip-vit-large-patch14 --image photo.jpg

# Batch processing
crispembed -m clip-vit-large-patch14 --image-dir ./photos/ --output embeddings.bin

Cross-modal pairing

Shares an embedding space with cstr/clip-text-large-GGUF for zero-shot image-text matching.

Notes

All output embeddings are L2-normalized.
This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.

Downloads last month: 119

GGUF

Model size

0.3B params

Architecture

vit

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support