jylins
/

vtsum_blip

cross-modal-video-summarization

video-summarization

video-captioning

Model card Files Files and versions Community

jylins commited on Jan 24, 2024

Commit

363a181

·

1 Parent(s): 048fdcf

add vt_clip

Files changed (2) hide show

README.md +6 -0
vt_clip.pth +3 -0

README.md CHANGED Viewed

@@ -20,6 +20,12 @@ size_categories:
 **Model type:**
 VTSUM-BLIP is an end-to-end cross-modal video summarization model.
 **Paper or resources for more information:**
 https://videoxum.github.io/

 **Model type:**
 VTSUM-BLIP is an end-to-end cross-modal video summarization model.
+**Model description:**
+- VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth
+- VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth
+- VT-CLIP for VT-CLIPScore metric: vt_clip.pth
+- BLIP w/ ViT-B and CapFilt-L ([Download](https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth)): model_base_capfilt_large.pth
 **Paper or resources for more information:**
 https://videoxum.github.io/

vt_clip.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e17526e15b2f663b1d5a6c4451350ae5b9903d4c7b8dcbc58bac1910d6fe15a4
+size 1795741805