VTSUM-BLIP Model Card

Model details

Model type: VTSUM-BLIP is an end-to-end cross-modal video summarization model.

Model description:

  • VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth
  • VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth
  • VT-CLIP for VT-CLIPScore metric: vt_clip.pth
  • BLIP w/ ViT-B and CapFilt-L (Download): model_base_capfilt_large.pth

The file structure of Model zoo looks like:

outputs
β”œβ”€β”€ blip
β”‚   └── model_base_capfilt_large.pth
β”œβ”€β”€ vt_clipscore
β”‚   └── vt_clip.pth
β”œβ”€β”€ vtsum_tt
β”‚   └── vtsum_tt.pth
└── vtsum_tt_ca
    └── vtsum_tt_ca.pth

Paper or resources for more information: https://videoxum.github.io/

Training dataset

  • VideoXum training set: 8K long videos long videos with 80K pairs of aligned video and text summaries.

Evaluation dataset

  • VideoXum val set: 2K long videos long videos with 80K pairs of aligned video and text summaries.
  • VideoXum test set: 4K long videos long videos with 80K pairs of aligned video and text summaries.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .