clip-text-msmarco-inversion

A vec2text inversion (hypothesizer) model that reconstructs text from the embeddings produced by the CLIP ViT-L/14 text encoder (the text encoder used by Stable Diffusion v1.5, identical to openai/clip-vit-large-patch14).

It is the first-stage model: given a CLIP text embedding, it produces an initial text hypothesis. Pair it with the corrector model Afrostnova/clip-text-msmarco-corrector to iteratively refine that hypothesis.

  • Base architecture: vec2text InversionModel (T5-based encoder–decoder)
  • Embedder: CLIPTextModelopenai/clip-vit-large-patch14
  • Training data: MS MARCO
  • Embedding transform: repeat (num_repeat_tokens=16)

⚠️ Requirements

This is not loadable with the upstream pip install vec2text package — upstream does not support a CLIPTextModel embedder. You need the fork that adds CLIP support (the one this model was trained with). With the wrong version, from_pretrained fails because the CLIPTextModel embedder is unknown.

Usage

import vec2text

inv = vec2text.models.InversionModel.from_pretrained("Afrostnova/clip-text-msmarco-inversion")
cor = vec2text.models.CorrectorEncoderModel.from_pretrained("Afrostnova/clip-text-msmarco-corrector")
corrector = vec2text.load_corrector(inv, cor)

# `embeddings` = CLIP text-encoder last_hidden_state, pooled at the EOS position.
text = vec2text.invert_embeddings(
    embeddings=embeddings,   # (batch, hidden_dim) on the same device as the model
    corrector=corrector,
    num_steps=20,
    sequence_beam_width=1,
)

The CLIP text encoder is loaded automatically when the inversion model is instantiated. On a clean machine it is fetched from openai/clip-vit-large-patch14; override with the CLIP_TEXT_ENCODER environment variable if you want a local copy.

Downloads last month
28
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support