AskVideos-VideoCLIPv0.2

Like it's image-only counterpart, CLIP, VideoCLIP enables you to compute a single embedding for videos that can be used to compute similarity with text.

VideoCLIP uses a Video Q-Former to aggregate frame-level embeddings temporally into a single embedding, maintaining relevance of the underlying content. The resulting embedding is then trained with contrastive loss + captioning loss to match it's corresponding text.

This is the latest version of the VideoCLIP model, incorporating more diverse and high quality data. Compared to v0.1, this model performs better on a larger distribution of data and works better on long range retrieval tasks.

In addition, the model also incorporates few architectural changes including a larger QFormer and embedding dimension (256 -> 1024).

Usage

Link to github to run the model: link.

# Load model.
import video_clip
eval_config = 'eval_configs/video_clip.yaml'
model, vis_processor = video_clip.load_model(eval_config)

# Compute video embeddings.
# video_embs: float matrix of size [num_videos, clip_dim_size, query_tokens] containing VideoCLIP embeddings.
# In this model, clip_dim_size=1024 and query_tokens=32.
video_embs = video_clip.get_all_video_embeddings(videos, model, vis_processor)

# Compute Video-Text similarity.
# v2t_sim: float matrix of size [num_videos, num_texts] indicating similarity.
v2t_sim = video_clip.compute_sim(model, texts, video_embs)

# Compute Text-Video similarity.
# t2v_sim: float matrix of size [num_texts, num_videos] indicating similarity.
t2v_sim = v2t_sim.T

# Compute Video-Video distance.
# v2v_dists: float vector of size [1, num_videos] indicating distance to query video embedding.
v2v_dists = video_clip.compute_dist_videoq(model, video_embs[0], video_embs)

For a more detailed demo of how to use the model, see the colab.