File size: 1,426 Bytes
c5deef6
 
048fdcf
 
 
 
 
 
 
 
 
 
 
c5deef6
048fdcf
 
 
 
 
 
 
 
363a181
 
 
 
 
 
f7b4925
 
 
 
 
 
 
 
 
 
 
 
 
048fdcf
 
 
f7b4925
 
048fdcf
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: apache-2.0
task_categories:
- summarization
language:
- en
tags:
- cross-modal-video-summarization
- video-summarization
- video-captioning
pretty_name: VideoXum
size_categories:
- 10K<n<100K
---

# VTSUM-BLIP Model Card

## Model details

**Model type:**
VTSUM-BLIP is an end-to-end cross-modal video summarization model.

**Model description:**
- VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth
- VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth
- VT-CLIP for VT-CLIPScore metric: vt_clip.pth
- BLIP w/ ViT-B and CapFilt-L ([Download](https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth)): model_base_capfilt_large.pth

**The file structure of Model zoo looks like:**
```
outputs
β”œβ”€β”€ blip
β”‚   └── model_base_capfilt_large.pth
β”œβ”€β”€ vt_clipscore
β”‚   └── vt_clip.pth
β”œβ”€β”€ vtsum_tt
β”‚   └── vtsum_tt.pth
└── vtsum_tt_ca
    └── vtsum_tt_ca.pth
```

**Paper or resources for more information:**
https://videoxum.github.io/



## Training dataset
- VideoXum *training* set: 8K long videos long videos with 80K pairs of aligned video and text summaries.

## Evaluation dataset
- VideoXum *val* set: 2K long videos long videos with 80K pairs of aligned video and text summaries.
- VideoXum *test* set: 4K long videos long videos with 80K pairs of aligned video and text summaries.