File size: 2,902 Bytes
b88cf0f
40b2001
 
 
 
 
468f4bb
d0579e9
b88cf0f
 
40b2001
 
7840dd2
b88cf0f
7840dd2
b88cf0f
7840dd2
b88cf0f
40b2001
b88cf0f
 
 
40b2001
b88cf0f
 
 
40b2001
 
 
b88cf0f
 
 
40b2001
 
7840dd2
40b2001
b88cf0f
 
 
40b2001
b88cf0f
 
 
40b2001
b88cf0f
 
40b2001
b88cf0f
40b2001
b88cf0f
486c8d1
b88cf0f
40b2001
b88cf0f
486c8d1
40b2001
b88cf0f
 
 
40b2001
b88cf0f
40b2001
b88cf0f
486c8d1
b88cf0f
7c3273d
5ae7b6e
7c3273d
 
8f43aa5
7c3273d
b88cf0f
486c8d1
b88cf0f
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
tags:
- vision
- clip
- clip4clip
- video
- retrieval
pipeline_tag: text-to-video
---

# Model Card
## Details
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip). 

The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.

To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers.

### Use with Transformers

```python
import numpy as np
import torch
from transformers import AutoTokenizer, CLIPTextModelWithProjection


search_sentence = "a basketball player performing a slam dunk"

model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid")

inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)

# Special projection and changing last layers:      
text_projection = model.state_dict()['text_projection.weight']
text_embeds = outputs[1] @ text_projection
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]

# Normalizing the embeddings:
final_output = final_output / final_output.norm(dim=-1, keepdim=True)
final_output = final_output.cpu().detach().numpy()
sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True)
print("sequence_output: ", sequence_output)
```

## Model Use

### Intended Use

This model is intended to use for video retrival, look for example this [**space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid). 

### Extra Information

We have 
For video embedding there is an extra notebook that describes how to embedd videos.



## Performance and Limitations

### Performance

We have evaluated the performance of differnet models on the last 10k video clips from Webvid database.

| Model | R1 | R5 | R10 | MedianR | MeanR
|------------------------|-------|-------|-------|-----|---------|
| Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128
| CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 |39.3023 
| CLIP4Clip trained on 150k Webvid (**This model**) | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535
| Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964

For more information about the evaluation you can look at this [notebook].