Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,12 @@
|
|
1 |
-
---
|
2 |
-
license: ecl-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: ecl-2.0
|
3 |
+
---
|
4 |
+
**VD-IT model**
|
5 |
+
|
6 |
+
The is our pre-trained checkpoint for our paper [**Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation**](https://arxiv.org/abs/2403.12042).
|
7 |
+
|
8 |
+
We use a video diffusion model ([ModelScopeT2V](https://modelscope.cn/models/damo/text-to-video-synthesis/summary)) as our base model, applying prompt tuning to adapt it as a visual backbone for downstream video understanding tasks.
|
9 |
+
|
10 |
+
### Model traning
|
11 |
+
We first pre-train our model on Ref-COCO and then fine-tune it on Ref-YouTube-VOS. The training of the models utilizes
|
12 |
+
two NVIDIA A100 GPUs, processing 5 frames per clip over the course of 9 epochs. The initial learning rate is set to 5e-5 and reduced by a factor of 10 at the 6th and 8th epochs.
|