jylins
/

vtsum_blip

jylins commited on Apr 22

Commit

f7b4925

•

1 Parent(s): 363a181

update readme

Files changed (1) hide show

README.md CHANGED Viewed

@@ -26,9 +26,24 @@ VTSUM-BLIP is an end-to-end cross-modal video summarization model.
 - VT-CLIP for VT-CLIPScore metric: vt_clip.pth
 - BLIP w/ ViT-B and CapFilt-L ([Download](https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth)): model_base_capfilt_large.pth
 **Paper or resources for more information:**
 https://videoxum.github.io/
 ## Training dataset
 - VideoXum *training* set: 8K long videos long videos with 80K pairs of aligned video and text summaries.

 - VT-CLIP for VT-CLIPScore metric: vt_clip.pth
 - BLIP w/ ViT-B and CapFilt-L ([Download](https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth)): model_base_capfilt_large.pth
+**The file structure of Model zoo looks like:**
+```
+outputs
+├── blip
+│   └── model_base_capfilt_large.pth
+├── vt_clipscore
+│   └── vt_clip.pth
+├── vtsum_tt
+│   └── vtsum_tt.pth
+└── vtsum_tt_ca
+    └── vtsum_tt_ca.pth
+```
 **Paper or resources for more information:**
 https://videoxum.github.io/
 ## Training dataset
 - VideoXum *training* set: 8K long videos long videos with 80K pairs of aligned video and text summaries.