CLIP Text Encoder Base (GGUF)

GGUF conversion of the CLIP text encoder (base) for use with CrispEmbed. Extracted from openai/clip-vit-base-patch16.

  • Architecture: CLIP text transformer with causal attention
  • Parameters: 63M
  • Output: 512-dimensional L2-normalized embeddings
  • Tokenizer: BPE tokenizer (embedded in GGUF), max 77 tokens
  • Size: ~244 MB

Usage

# Embed a single text
crispembed -m clip-text-base "a photo of a cat"

# Embed from file
crispembed -m clip-text-base --input queries.txt --output embeddings.bin

Cross-modal pairing

Output embeddings live in the same vector space as cstr/clip-vit-base-patch16-GGUF. Use both for zero-shot image-text retrieval:

crispembed -m clip-text-base "a photo of a cat"               # text embedding
crispembed -m clip-vit-base-patch16 --image photo.jpg          # vision embedding
# cosine similarity measures image-text alignment

Notes

  • All output embeddings are L2-normalized.
  • BPE tokenizer is bundled inside the GGUF file; no external vocab files needed.
  • This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
Downloads last month
136
GGUF
Model size
63.4M params
Architecture
clip_text
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support