Edit model card


Like it's image-only counterpart, CLIP, VideoCLIP enables you to compute a single embedding for videos that can be used to compute similarity with text.

VideoCLIP uses a Video Q-Former to aggregate frame-level embeddings temporally into a single embedding, maintaining relevance of the underlying content. The resulting embedding is then trained with contrastive loss + captioning loss to match it's corresponding text.


Link to github to run the model: link.

# Load model.
import video_clip
eval_config = 'eval_configs/video_clip.yaml'
model, vis_processor = video_clip.load_model(eval_config)

# Compute video embeddings.
# video_embs: float matrix of size [num_videos, clip_dim_size, query_tokens] containing VideoCLIP embeddings.
# In this model, clip_dim_size=1024 and query_tokens=32.
video_embs = video_clip.get_all_video_embeddings(videos, model, vis_processor)

# Compute Video-Text similarity.
# v2t_sim: float matrix of size [num_videos, num_texts] indicating similarity.
v2t_sim = video_clip.compute_sim(model, texts, video_embs)

# Compute Text-Video similarity.
# t2v_sim: float matrix of size [num_texts, num_videos] indicating similarity.
t2v_sim = v2t_sim.T

# Compute Video-Video distance.
# v2v_dists: float vector of size [1, num_videos] indicating distance to query video embedding.
v2v_dists = video_clip.compute_dist_videoq(model, video_embs[0], video_embs)

For a more detailed demo of how to use the model, see the colab.

Downloads last month


Downloads are not tracked for this model. How to track
Unable to determine this model's library. Check the docs .