eduardo-alvarez commited on
Commit
e9b8d28
1 Parent(s): 9351e9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -17
README.md CHANGED
@@ -13,6 +13,10 @@ library_name: transformers
13
 
14
  # TVP base model
15
 
 
 
 
 
16
  | Model Detail | Description |
17
  | ----------- | ----------- |
18
  | Model Authors | Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding |
@@ -55,23 +59,6 @@ Unitary results: Refer to Table 2 in the provided paper for TVP's performance on
55
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e1cfa7f9927d9455acdc72/WOeve3VDZU2WvoXfvoK5X.png)
56
 
57
 
58
- # TVP base model
59
-
60
- The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of
61
- this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems. It was introduced in
62
- [this paper](https://arxiv.org/pdf/2303.04995.pdf).
63
-
64
- TVP got accepted to [CVPR'23](https://cvpr2023.thecvf.com/) conference.
65
-
66
- ## Model description
67
-
68
- The abstract from the paper is the following:
69
- In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call ‘prompts’) into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of cross-modal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.
70
-
71
- ## Intended uses & limitations(TODO)
72
-
73
- You can use the raw model for temporal video grounding.
74
-
75
  ### How to use
76
 
77
  Here is how to use this model to get the logits of a given video and text in PyTorch:
 
13
 
14
  # TVP base model
15
 
16
+ The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of
17
+ this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems. It was introduced in
18
+ [this paper](https://arxiv.org/pdf/2303.04995.pdf).
19
+
20
  | Model Detail | Description |
21
  | ----------- | ----------- |
22
  | Model Authors | Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding |
 
59
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63e1cfa7f9927d9455acdc72/WOeve3VDZU2WvoXfvoK5X.png)
60
 
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ### How to use
63
 
64
  Here is how to use this model to get the logits of a given video and text in PyTorch: