Add files using upload-large-folder tool
Browse files- README.md +1 -19
- SimToken_Setup_Upload_Download_Guide.md +182 -0
- TubeToken_Experiment_Plan_v4_Final.md +1634 -0
- TubeToken_Phase0_Experiment_Log.md +284 -0
- __pycache__/load_model.cpython-312.pyc +0 -0
- load_model.py +33 -31
- runs/tubetoken_phase0/eval_stride8_n64_bidir/report.md +12 -0
- runs/tubetoken_phase0/eval_stride8_n64_bidir/sample_metrics.csv +0 -0
- runs/tubetoken_phase0/eval_stride8_n64_bidir/summary.json +132 -0
- runs/tubetoken_phase0/proposals_stride8_n64_bidir/manifest.json +0 -0
- upload.log +0 -0
README.md
CHANGED
|
@@ -23,41 +23,30 @@ Download the official Ref-AVSBench dataset from [here](https://github.com/GeWu-L
|
|
| 23 |
|
| 24 |
### Pretrained Backbones
|
| 25 |
Download the sam_vit_h_4b8939.pth and put it in ```./models/segment_anything```
|
| 26 |
-
|
| 27 |
### Checkpoints
|
| 28 |
Download our pretrained **[Simtoken](https://drive.google.com/file/d/1pargYfFy93rymCANuWV0nt6Lx3Ri406l/view?usp=sharing)**.
|
| 29 |
-
|
| 30 |
### Core Requirements
|
| 31 |
This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
|
| 32 |
- `numpy`, `pandas`, `matplotlib`, `opencv`
|
| 33 |
- `einops`, `timm`
|
| 34 |
- `sentencepiece`
|
| 35 |
- `transformers`, `peft`
|
| 36 |
-
|
| 37 |
Newer versions of `transformers` and `peft` may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
|
| 38 |
To avoid such compatibility issues, we recommend **not using overly recent versions** and pin the two packages to the versions used during our development:
|
| 39 |
-
|
| 40 |
- `transformers==4.30.2`
|
| 41 |
- `peft==0.2.0`
|
| 42 |
-
|
| 43 |
We also provide a complete requirements.txt for reference and easier reproduction:
|
| 44 |
```
|
| 45 |
pip install -r requirements.txt
|
| 46 |
```
|
| 47 |
-
|
| 48 |
-
|
| 49 |
---
|
| 50 |
## 📌 Getting Started
|
| 51 |
-
|
| 52 |
### Preparation
|
| 53 |
We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
|
| 54 |
```
|
| 55 |
python save_audio_feats.py --data_dir 'path/to/data'
|
| 56 |
python save_sam_feats.py --data_dir 'path/to/data'
|
| 57 |
```
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
### Train
|
| 62 |
To train our model on Ref-AVS Bench:
|
| 63 |
```
|
|
@@ -68,7 +57,6 @@ python -W ignore train.py --name 'xxx' \
|
|
| 68 |
--data_dir 'path/to/data'\
|
| 69 |
--log_root 'path/to/log_root'\
|
| 70 |
--checkpoint_root 'path/to/checkpoints_root'
|
| 71 |
-
|
| 72 |
```
|
| 73 |
### Test
|
| 74 |
To test our pretrained simtoken:
|
|
@@ -79,10 +67,4 @@ python -W ignore load_model.py --saved_model 'path/to/checkpoint.pth' \
|
|
| 79 |
--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
|
| 80 |
--data_dir 'path/to/data' \
|
| 81 |
--visualization_root 'path/to/visualization_root'
|
| 82 |
-
|
| 83 |
-
```
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
|
|
|
| 23 |
|
| 24 |
### Pretrained Backbones
|
| 25 |
Download the sam_vit_h_4b8939.pth and put it in ```./models/segment_anything```
|
|
|
|
| 26 |
### Checkpoints
|
| 27 |
Download our pretrained **[Simtoken](https://drive.google.com/file/d/1pargYfFy93rymCANuWV0nt6Lx3Ri406l/view?usp=sharing)**.
|
|
|
|
| 28 |
### Core Requirements
|
| 29 |
This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
|
| 30 |
- `numpy`, `pandas`, `matplotlib`, `opencv`
|
| 31 |
- `einops`, `timm`
|
| 32 |
- `sentencepiece`
|
| 33 |
- `transformers`, `peft`
|
|
|
|
| 34 |
Newer versions of `transformers` and `peft` may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
|
| 35 |
To avoid such compatibility issues, we recommend **not using overly recent versions** and pin the two packages to the versions used during our development:
|
|
|
|
| 36 |
- `transformers==4.30.2`
|
| 37 |
- `peft==0.2.0`
|
|
|
|
| 38 |
We also provide a complete requirements.txt for reference and easier reproduction:
|
| 39 |
```
|
| 40 |
pip install -r requirements.txt
|
| 41 |
```
|
|
|
|
|
|
|
| 42 |
---
|
| 43 |
## 📌 Getting Started
|
|
|
|
| 44 |
### Preparation
|
| 45 |
We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
|
| 46 |
```
|
| 47 |
python save_audio_feats.py --data_dir 'path/to/data'
|
| 48 |
python save_sam_feats.py --data_dir 'path/to/data'
|
| 49 |
```
|
|
|
|
|
|
|
|
|
|
| 50 |
### Train
|
| 51 |
To train our model on Ref-AVS Bench:
|
| 52 |
```
|
|
|
|
| 57 |
--data_dir 'path/to/data'\
|
| 58 |
--log_root 'path/to/log_root'\
|
| 59 |
--checkpoint_root 'path/to/checkpoints_root'
|
|
|
|
| 60 |
```
|
| 61 |
### Test
|
| 62 |
To test our pretrained simtoken:
|
|
|
|
| 67 |
--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
|
| 68 |
--data_dir 'path/to/data' \
|
| 69 |
--visualization_root 'path/to/visualization_root'
|
| 70 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
SimToken_Setup_Upload_Download_Guide.md
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SimToken Setup, Data, Upload, and Download Guide
|
| 2 |
+
|
| 3 |
+
This guide is for moving the SimToken workspace between rented servers.
|
| 4 |
+
|
| 5 |
+
Assumed paths:
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
PROJECT_ROOT=/workspace/SimToken
|
| 9 |
+
SAM2_ROOT=/workspace/sam2
|
| 10 |
+
HF_REPO=yfan07/SimToken
|
| 11 |
+
```
|
| 12 |
+
|
| 13 |
+
## 1. Environment Setup
|
| 14 |
+
|
| 15 |
+
```bash
|
| 16 |
+
conda create -n simtoken python=3.10 -y
|
| 17 |
+
conda activate simtoken
|
| 18 |
+
|
| 19 |
+
conda install -c conda-forge ffmpeg libsndfile git git-lfs wget -y
|
| 20 |
+
git lfs install
|
| 21 |
+
|
| 22 |
+
pip install --upgrade pip setuptools wheel
|
| 23 |
+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
If CUDA 12.6 wheels are unavailable, use CUDA 12.1 wheels:
|
| 27 |
+
|
| 28 |
+
```bash
|
| 29 |
+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
Install SimToken dependencies:
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
pip install \
|
| 36 |
+
numpy pandas matplotlib opencv-python pillow tqdm einops timm sentencepiece \
|
| 37 |
+
transformers==4.30.2 peft==0.2.0 accelerate safetensors huggingface-hub \
|
| 38 |
+
packaging regex requests psutil gdown
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
Optional, only needed if regenerating audio features:
|
| 42 |
+
|
| 43 |
+
```bash
|
| 44 |
+
pip install towhee towhee.models
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## 2. Repository Download
|
| 48 |
+
|
| 49 |
+
```bash
|
| 50 |
+
cd /workspace
|
| 51 |
+
huggingface-cli login
|
| 52 |
+
|
| 53 |
+
huggingface-cli download yfan07/SimToken \
|
| 54 |
+
--repo-type model \
|
| 55 |
+
--local-dir /workspace/SimToken \
|
| 56 |
+
--local-dir-use-symlinks False
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## 3. Model Preparation
|
| 60 |
+
|
| 61 |
+
### SAM for SimToken
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
mkdir -p /workspace/SimToken/models/segment_anything
|
| 65 |
+
cd /workspace/SimToken/models/segment_anything
|
| 66 |
+
|
| 67 |
+
wget -O sam_vit_h_4b8939.pth \
|
| 68 |
+
https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
### SimToken Checkpoint
|
| 72 |
+
|
| 73 |
+
```bash
|
| 74 |
+
mkdir -p /workspace/SimToken/checkpoints
|
| 75 |
+
|
| 76 |
+
gdown 'https://drive.google.com/uc?id=1pargYfFy93rymCANuWV0nt6Lx3Ri406l' \
|
| 77 |
+
-O /workspace/SimToken/checkpoints/simtoken_pretrained.pth
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
### Hugging Face Models
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
mkdir -p /workspace/hf_models
|
| 84 |
+
|
| 85 |
+
huggingface-cli download openai/clip-vit-large-patch14 \
|
| 86 |
+
--local-dir /workspace/hf_models/clip-vit-large-patch14 \
|
| 87 |
+
--local-dir-use-symlinks False
|
| 88 |
+
|
| 89 |
+
huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5 \
|
| 90 |
+
--local-dir /workspace/hf_models/Chat-UniVi-7B-v1.5 \
|
| 91 |
+
--local-dir-use-symlinks False
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
### SAM2 for TubeToken Proposals
|
| 95 |
+
|
| 96 |
+
Put SAM2 under `/workspace/sam2`:
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
cd /workspace
|
| 100 |
+
git clone https://github.com/facebookresearch/sam2.git
|
| 101 |
+
cd /workspace/sam2
|
| 102 |
+
|
| 103 |
+
pip install -e .
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
Download SAM2.1 checkpoints:
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
cd /workspace/sam2/checkpoints
|
| 110 |
+
bash download_ckpts.sh
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
The TubeToken Phase 0 commands use:
|
| 114 |
+
|
| 115 |
+
```text
|
| 116 |
+
/workspace/sam2/checkpoints/sam2.1_hiera_large.pt
|
| 117 |
+
/workspace/sam2/sam2/configs/sam2.1/sam2.1_hiera_l.yaml
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
## 4. Dataset Preparation
|
| 121 |
+
|
| 122 |
+
Runtime layout:
|
| 123 |
+
|
| 124 |
+
```text
|
| 125 |
+
/workspace/SimToken/data
|
| 126 |
+
metadata.csv
|
| 127 |
+
media/
|
| 128 |
+
gt_mask/
|
| 129 |
+
audio_embed/
|
| 130 |
+
image_embed/
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
Package the four data directories:
|
| 134 |
+
|
| 135 |
+
```bash
|
| 136 |
+
cd /workspace/SimToken/data
|
| 137 |
+
|
| 138 |
+
tar -cf media.tar media
|
| 139 |
+
tar -czf gt_mask.tar.gz gt_mask
|
| 140 |
+
tar -czf audio_embed.tar.gz audio_embed
|
| 141 |
+
tar -cf image_embed.tar image_embed
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
Restore the four data directories:
|
| 145 |
+
|
| 146 |
+
```bash
|
| 147 |
+
cd /workspace/SimToken/data
|
| 148 |
+
|
| 149 |
+
tar -xf media.tar
|
| 150 |
+
tar -xzf gt_mask.tar.gz
|
| 151 |
+
tar -xzf audio_embed.tar.gz
|
| 152 |
+
tar -xf image_embed.tar
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
## 5. Upload Repository
|
| 156 |
+
|
| 157 |
+
Use one full-directory upload command:
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
cd /workspace/SimToken
|
| 161 |
+
huggingface-cli login
|
| 162 |
+
|
| 163 |
+
huggingface-cli upload yfan07/SimToken . . \
|
| 164 |
+
--repo-type model \
|
| 165 |
+
2>&1 | tee upload.log
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
This uploads the whole `/workspace/SimToken` directory according to the current local files.
|
| 169 |
+
|
| 170 |
+
## 6. Current Experiment Files to Preserve
|
| 171 |
+
|
| 172 |
+
Keep these files and directories for continuing TubeToken experiments:
|
| 173 |
+
|
| 174 |
+
```text
|
| 175 |
+
runs/tubetoken_phase_minus1/audit_full
|
| 176 |
+
runs/tubetoken_phase_minus1/simtoken_eval
|
| 177 |
+
runs/tubetoken_phase0/proposals_stride8_n64_bidir
|
| 178 |
+
runs/tubetoken_phase0/eval_stride8_n64_bidir
|
| 179 |
+
runs/tubetoken_phase0/miss_videos_r64.txt
|
| 180 |
+
TubeToken_Phase0_Experiment_Log.md
|
| 181 |
+
TubeToken_Experiment_Plan_v4_Final.md
|
| 182 |
+
```
|
TubeToken_Experiment_Plan_v4_Final.md
ADDED
|
@@ -0,0 +1,1634 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TubeToken 实验计划 v4(Final / Experiment-Ready)
|
| 2 |
+
|
| 3 |
+
> 主线:以 **TubeToken** 为核心框架,将 **Existence / Null 建模** 与 **Text-Audio Conditional Compression** 作为 TubeToken 的自然组成部分,而不是作为 SimToken 的外接补丁。
|
| 4 |
+
> v4 目标:在 v3 Reviewer-Revised 的基础上完成最后一轮实验前定稿,固定 matched-compute baseline 的实现,修正 Phase 0 红灯条件,精确化 H3 CosSim baseline,补充 multi-expression training 的梯度冲突风险,重构主表与公平性分析表,并明确多 expression 场景下的 proposal amortization efficiency。
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## 0. v4 最终修改摘要
|
| 9 |
+
|
| 10 |
+
本版是实验启动前的最终方案。v3 已经具备启动实验的完整框架;v4 只做定稿级别的精修,重点消除可能导致后期 Reviewer 质疑或实验返工的模糊点。
|
| 11 |
+
|
| 12 |
+
相较 v3,v4 做了以下最终修改:
|
| 13 |
+
|
| 14 |
+
1. **固定 SimToken + matched compute 的唯一实现**:不再保留四个候选方案,明确使用 **SimToken + multiple keyframe prompting with the same number of keyframes as TubeToken-Fast**。该对照在概念上最接近 TubeToken-Fast 的额外计算来源,也避免实验结束后选择有利 baseline 的嫌疑。
|
| 15 |
+
2. **修正 Milestone 1 第三条红灯条件**:删除 “预计 TubeToken-Minimal 无法获得 selection 收益” 这类 Phase 0 不可观测判断,改为完全基于 Phase 0 可测量量:Recall@32、Oracle Tube J/F、Oracle Refined J/F。
|
| 16 |
+
3. **精确化 Fixed Q-Former 的 H3 CosSim baseline**:Fixed Q-Former 对同一 tube 的不同 expression 输出完全相同,因此 cross-expression CosSim **恒等于 1.0**,不是“接近 1”。Conditioned Q-Former 是否显著低于 1.0 是 H3 的直接证据。
|
| 17 |
+
4. **补充 multi-expression training 的梯度冲突风险与缓解方案**:若不同 expressions 对同一 tube 要求矛盾的 temporal / audio / spatial attention,可使用 gradient accumulation 分开累积,或先采样语义差异较小的 expression pair。
|
| 18 |
+
5. **重构主表为顶会友好格式**:主表精简为 8 行,只保留主要公开 baseline 与 TubeToken 主配置;SimToken + SAM2 proposals、learned reranker、matched compute、TubeToken-Minimal、TubeToken-Fast 移入独立 Fairness Analysis Table。
|
| 19 |
+
6. **在 efficiency 中明确 per-video 与 per-expression 成本**:SAM2 proposals 是 per video 一次性成本;在同一视频有 K 个 expressions 时,proposal cost 可在 expressions 间摊销,CondQFormer 与 selector 才是 per-expression 成本。
|
| 20 |
+
7. **澄清 Selection Acc@3 对 null tube 的处理**:正样本计算 object-level Top-3 时排除 null tube;“GT tube Top-3 but Null Top-1” 作为独立 null calibration 指标在全 ranking 中计算。
|
| 21 |
+
8. **明确 error decomposition 的互斥优先级**:每个失败样本只归入一个错误类别,按 Proposal miss → Null FN with GT Top-3 → Null FN without GT Top-3 → Selection error → Refinement error → Null FP 的优先级判定。
|
| 22 |
+
9. **更新 Phase -1 Go/No-Go 标准**:SimToken 复现与 multi-expression audit 可并行启动;若 SimToken 复现差异 > 1.5 J&F,则暂停后续实验;若 multi-expression 不足,则将 H3 direct validation 从 P0 降为 P2 并采用回退叙事。
|
| 23 |
+
10. **更新 Appendix 检查表**:把最终 Reviewer 精修建议全部纳入落地状态,形成实验前 checklist。
|
| 24 |
+
|
| 25 |
+
## 1. 核心研究假设
|
| 26 |
+
|
| 27 |
+
### 1.1 任务重述
|
| 28 |
+
|
| 29 |
+
Referring Audio-Visual Segmentation, Ref-AVS, 不应仅被建模为:
|
| 30 |
+
|
| 31 |
+
\[
|
| 32 |
+
\text{MLLM} \rightarrow \langle SEG \rangle \rightarrow \text{SAM}
|
| 33 |
+
\]
|
| 34 |
+
|
| 35 |
+
而应被建模为:
|
| 36 |
+
|
| 37 |
+
\[
|
| 38 |
+
\text{Candidate Object Tubes}
|
| 39 |
+
\rightarrow
|
| 40 |
+
\text{Text-Audio Conditioned Tube Selection}
|
| 41 |
+
\rightarrow
|
| 42 |
+
\text{SAM Refinement}
|
| 43 |
+
\]
|
| 44 |
+
|
| 45 |
+
也就是说,Ref-AVS 的本质更接近 **object-level retrieval + mask refinement**:
|
| 46 |
+
|
| 47 |
+
1. 视频中有哪些候选对象实例?
|
| 48 |
+
2. 哪一个对象实例被文本和音频共同指代?
|
| 49 |
+
3. 如果没有符合条件的对象,模型能否显式选择 Null?
|
| 50 |
+
4. 选中的对象 tube 是否能被进一步精修为高质量 mask?
|
| 51 |
+
|
| 52 |
+
### 1.2 主要假设
|
| 53 |
+
|
| 54 |
+
**H1: Object tube 是比 global `<SEG>` token 更适合 Ref-AVS 的中间表示。**
|
| 55 |
+
Tube 可以显式保持跨帧身份一致性,降低同类多实例、遮挡、出入画面情况下的 identity switch 风险。
|
| 56 |
+
|
| 57 |
+
**H2: Null / Existence 应该通过显式候选建模解决。**
|
| 58 |
+
TubeToken 中引入 learnable null tube,将 Null 判断转化为候选选择问题,而不是依赖 SAM decoder 被动输出空 mask。
|
| 59 |
+
|
| 60 |
+
**H3: 同一 candidate tube 在不同 referring expression 下应暴露不同的时序证据,因此 tube 表征必须由 text/audio condition 动态调制。**
|
| 61 |
+
在 TubeToken 中,conditional compression 不是全视频 token pooling 的替代品,而是 **tube-level evidence summarization**。同一 object tube 对于不同表达可能需要关注不同帧、不同动作、不同音频片段或不同空间关系。
|
| 62 |
+
|
| 63 |
+
**H3 的成立前提与验证要求:**
|
| 64 |
+
|
| 65 |
+
1. 数据层面必须先确认 Ref-AVSBench 中是否存在多个 expression 指向同一视频或同一目标。
|
| 66 |
+
2. 若存在 multi-expression 结构,训练阶段必须显式利用它:对同一视频 / 同一 tube 使用至少两个不同 expressions 进行 forward pass,共享 proposals,但使用不同 conditional queries。
|
| 67 |
+
3. 验证 H3 时不能只报告 AC。AC 只能证明模型是否关注正确区域 / 正确 tube,不能证明同一 tube 在不同 expression 下产生了差异化证据摘要。
|
| 68 |
+
4. H3 的直接验证指标是:同一视频、同一 matched GT tube、不同 expression 下 \(\tilde{z}_i\) 的 cosine similarity。Fixed Q-Former 因为不依赖 expression,对同一 tube 的不同 expression 输出完全相同,CosSim \(\equiv 1.0\);conditioned Q-Former 的 similarity 应显著低于 1.0,并且 selection performance 不下降。
|
| 69 |
+
5. 若数据审计发现每个视频平均只有一个 expression,则 H3 不作为主贡献,论文主线应回退为 “proposal-conditioned instance grounding + explicit null reasoning”。
|
| 70 |
+
|
| 71 |
+
**H4: TubeToken 的收益必须通过 proposal recall、oracle upper bound、selection accuracy、refinement quality 和 efficiency breakdown 分别解释。**
|
| 72 |
+
不能只报告最终 J/F/S,否则无法回答性能提升来自哪里,也无法判断瓶颈位于 proposal、selection 还是 refinement。
|
| 73 |
+
|
| 74 |
+
**H5: TubeToken 的提升必须在公平计算量和公平 proposal 条件下仍然成立。**
|
| 75 |
+
必须通过 SimToken + SAM2 proposals、SimToken + matched compute、SAM2 proposals + learned reranker(no null tube)等对照排除 “只是 SAM2 proposal 更强” 或 “只是计算量更多” 的解释。
|
| 76 |
+
|
| 77 |
+
## 2. 方法版本定义
|
| 78 |
+
|
| 79 |
+
### 2.1 TubeToken-Full
|
| 80 |
+
|
| 81 |
+
完整方法包含四个阶段。
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
### Stage 1: Candidate tube generation
|
| 86 |
+
|
| 87 |
+
在关键帧上使用 SAM2 automatic mask generation 产生候选 masks,并用 SAM2 tracking / memory 机制向前后帧传播,得到候选 object tubes:
|
| 88 |
+
|
| 89 |
+
\[
|
| 90 |
+
\mathcal{O} = \{o_1, o_2, \dots, o_N\}
|
| 91 |
+
\]
|
| 92 |
+
|
| 93 |
+
每个 tube:
|
| 94 |
+
|
| 95 |
+
\[
|
| 96 |
+
o_i = \{m_{i,t}, b_{i,t}, f_{i,t}\}_{t=1}^{T}
|
| 97 |
+
\]
|
| 98 |
+
|
| 99 |
+
其中:
|
| 100 |
+
|
| 101 |
+
- \(m_{i,t}\):第 \(t\) 帧 mask;
|
| 102 |
+
- \(b_{i,t}\):第 \(t\) 帧 bbox;
|
| 103 |
+
- \(f_{i,t}\):mask-pooled visual feature。
|
| 104 |
+
|
| 105 |
+
**实现约定**:
|
| 106 |
+
默认在关键帧上运行 SAM2 AMG,在非关键帧上使用 SAM2 propagation,而不是每帧重新运行 AMG。这样可以避免 proposal 阶段计算量过高。
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
### Stage 2: Text-audio conditioned tube representation
|
| 111 |
+
|
| 112 |
+
文本表达编码为 \(e_{text}\),音频编码为 \(e_{audio}\)。构造条件化 query:
|
| 113 |
+
|
| 114 |
+
\[
|
| 115 |
+
Q = Q_0 + W_t e_{text} + W_a e_{audio} + W_{ta}(e_{text} \odot e_{audio})
|
| 116 |
+
\]
|
| 117 |
+
|
| 118 |
+
对每个 tube 的时序特征 \(\{f_{i,t}\}_{t=1}^{T}\) 进行条件化压缩:
|
| 119 |
+
|
| 120 |
+
\[
|
| 121 |
+
\tilde{z}_i = \text{CondQFormer}(Q, \{f_{i,t}\}_{t=1}^{T})
|
| 122 |
+
\]
|
| 123 |
+
|
| 124 |
+
该模块的目标不是单纯减少 token 数,而是让同一 tube 在不同 expression 下形成不同的证据摘要。
|
| 125 |
+
|
| 126 |
+
**v3 约束:** 如果数据集中存在多 expression 样本,Stage 2 的训练必须在 batch 内显式包含同一视频 / 同一 tube 的不同 expression forward pass。否则 H3 只能作为推理假设,不能作为强实验证明。
|
| 127 |
+
|
| 128 |
+
#### 2.2 特征来源说明
|
| 129 |
+
|
| 130 |
+
默认设定:
|
| 131 |
+
|
| 132 |
+
\[
|
| 133 |
+
f_{i,t} = \text{MaskPool}(\text{SAM2ImageEncoder}(I_t), m_{i,t})
|
| 134 |
+
\]
|
| 135 |
+
|
| 136 |
+
也就是说,Stage 2 复用 SAM2 image encoder 特征,不额外引入独立 ViT 或 CLIP visual encoder。这样有三个好处:
|
| 137 |
+
|
| 138 |
+
1. proposal generation 与 tube representation 使用一致的视觉特征;
|
| 139 |
+
2. 避免额外视觉 encoder 带来的计算量和公平性争议;
|
| 140 |
+
3. efficiency table 更清楚,便于与 SimToken 和 SAM2-based baselines 对比。
|
| 141 |
+
|
| 142 |
+
可选扩展:若 SAM2 encoder feature 与文本/音频语义对齐不足,可增加一个轻量 projector:
|
| 143 |
+
|
| 144 |
+
\[
|
| 145 |
+
f'_{i,t} = W_v f_{i,t}
|
| 146 |
+
\]
|
| 147 |
+
|
| 148 |
+
但默认不引入额外大规模 visual-language encoder。
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
### Stage 3: Tube selection with null tube
|
| 153 |
+
|
| 154 |
+
加入一个 learnable null tube:
|
| 155 |
+
|
| 156 |
+
\[
|
| 157 |
+
z_{null}
|
| 158 |
+
\]
|
| 159 |
+
|
| 160 |
+
将所有候选 tubes 与 null tube 一起输入 tube selector:
|
| 161 |
+
|
| 162 |
+
\[
|
| 163 |
+
P(i \mid video, audio, text) =
|
| 164 |
+
\text{Softmax}([s_1, s_2, \dots, s_N, s_{null}])
|
| 165 |
+
\]
|
| 166 |
+
|
| 167 |
+
若 \(P(null)\) 最大,则输出空 mask;否则选择得分最高的 object tube。
|
| 168 |
+
|
| 169 |
+
Existence probability 自然定义为:
|
| 170 |
+
|
| 171 |
+
\[
|
| 172 |
+
p_{exist} = 1 - P(null)
|
| 173 |
+
\]
|
| 174 |
+
|
| 175 |
+
#### Tube selector 默认结构
|
| 176 |
+
|
| 177 |
+
默认采用:
|
| 178 |
+
|
| 179 |
+
1. reference query \(q_{ref}=\text{MLP}([e_{text},e_{audio}])\);
|
| 180 |
+
2. tube tokens \(\{\tilde{z}_i\}_{i=1}^{N}\);
|
| 181 |
+
3. inter-tube self-attention;
|
| 182 |
+
4. reference-conditioned cross-attention;
|
| 183 |
+
5. per-tube classification head。
|
| 184 |
+
|
| 185 |
+
必须做消融:
|
| 186 |
+
|
| 187 |
+
- w/ inter-tube self-attention;
|
| 188 |
+
- w/o inter-tube self-attention;
|
| 189 |
+
- independent tube scoring,即每个 tube 独立通过 \([q_{ref}; \tilde{z}_i]\) 的线性层打分。
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
### Stage 4: SAM refinement
|
| 194 |
+
|
| 195 |
+
选中 tube 后,默认只使用 tube bbox 作为 box prompt,并结合 text/audio semantic prompt 进行 SAM refinement:
|
| 196 |
+
|
| 197 |
+
\[
|
| 198 |
+
\hat{m}_t = \text{SAMRefine}(I_t, b_{i,t}, q_{ref})
|
| 199 |
+
\]
|
| 200 |
+
|
| 201 |
+
默认不使用 tube mask 作为 mask prompt,避免“自我精修”带来的解释问题。tube mask 只用于:
|
| 202 |
+
|
| 203 |
+
1. 生成 bbox;
|
| 204 |
+
2. 提取 tube feature;
|
| 205 |
+
3. proposal matching;
|
| 206 |
+
4. oracle upper bound 计算。
|
| 207 |
+
|
| 208 |
+
需要额外做对照:
|
| 209 |
+
|
| 210 |
+
- bbox-only prompt;
|
| 211 |
+
- bbox + semantic prompt;
|
| 212 |
+
- bbox + mask prompt。
|
| 213 |
+
|
| 214 |
+
如果 bbox + mask prompt 没有明显收益,正文采用 bbox-only 或 bbox + semantic prompt 作为默认版本。
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## 3. 数据审计与诊断子集构建
|
| 219 |
+
|
| 220 |
+
正式训练前必须先完成数据审计。该步骤决定后续实验是否有足够说服力。v3 将数据审计升级为 **Phase -1**,其中 multi-expression 结构与 SimToken 复现是进入 Phase 0 的前置条件。
|
| 221 |
+
|
| 222 |
+
### 3.1 必统计项目
|
| 223 |
+
|
| 224 |
+
| 项目 | 目的 |
|
| 225 |
+
|---|---|
|
| 226 |
+
| 每个视频的 referring expression 数量 | 判断 H3 是否可以被直接训练和验证 |
|
| 227 |
+
| 每个 GT object / tube 对应的 expression 数量 | 构建 H3 direct validation subset |
|
| 228 |
+
| SimToken alignment loss 中正样本表达集 \(\mathcal{P}_i\) 是否可复用 | 决定 multi-expression training 的实现路径 |
|
| 229 |
+
| Null 样本比例 | 判断 null tube / weighted CE 的训练难度 |
|
| 230 |
+
| GT 目标可见帧比例 | 决定是否需要 frame-level existence;若比例低则不引入 |
|
| 231 |
+
| 目标首次出现时间分布 | 构建 late-target subset,验证是否缓解 first-frame bias |
|
| 232 |
+
| 同类多实例比例 | 验证 inter-tube reasoning 和 hard negative 是否必要 |
|
| 233 |
+
| 小目标 / 遮挡目标比例 | 评估 proposal recall 风险 |
|
| 234 |
+
| 音频依赖表达比例 | 验证 audio-conditioned compression 是否有空间 |
|
| 235 |
+
| 空间关系表达比例 | 验证 spatial/relation query 是否必要 |
|
| 236 |
+
| Proposal miss 与目标属性关系 | 分析 SAM2 proposal 对小目标、遮挡、unseen 类别的系统性偏差 |
|
| 237 |
+
|
| 238 |
+
### 3.1.1 Multi-expression audit 的决策规则
|
| 239 |
+
|
| 240 |
+
| 审计结果 | 对 H3 和 CondQFormer 的影响 |
|
| 241 |
+
|---|---|
|
| 242 |
+
| 每个视频平均 expression 数 > 1.5,且同一 GT object 有多个 expression | 正常推进 H3;使用 multi-expression training 和 direct cosine validation |
|
| 243 |
+
| 多数视频只有 1 个 expression,但少量视频有多 expression | H3 作为诊断性贡献;在 multi-expression subset 上报告直接验证 |
|
| 244 |
+
| 每个视频基本只有 1 个 expression | 不把 H3 作为核心 claim;CondQFormer 改述为 learned tube compression / multimodal query adaptation |
|
| 245 |
+
|
| 246 |
+
---
|
| 247 |
+
|
| 248 |
+
### 3.2 诊断子集
|
| 249 |
+
|
| 250 |
+
至少构建以下子集。
|
| 251 |
+
|
| 252 |
+
#### 3.2.1 Late-target subset
|
| 253 |
+
|
| 254 |
+
目标首次可见帧位于视频后 50% 的样本。
|
| 255 |
+
|
| 256 |
+
定义:
|
| 257 |
+
|
| 258 |
+
\[
|
| 259 |
+
t_{first} = \min \{t \mid g_t \neq \emptyset\}
|
| 260 |
+
\]
|
| 261 |
+
|
| 262 |
+
若:
|
| 263 |
+
|
| 264 |
+
\[
|
| 265 |
+
t_{first} > 0.5T
|
| 266 |
+
\]
|
| 267 |
+
|
| 268 |
+
则归入 late-target subset。
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
#### 3.2.2 Audio-critical subset
|
| 273 |
+
|
| 274 |
+
v3 继续采用两阶段定义。
|
| 275 |
+
|
| 276 |
+
**Stage A: 初筛**
|
| 277 |
+
|
| 278 |
+
通过文本关键词筛选:
|
| 279 |
+
|
| 280 |
+
- sounding;
|
| 281 |
+
- making sound;
|
| 282 |
+
- longest sound;
|
| 283 |
+
- intermittent sound;
|
| 284 |
+
- silent;
|
| 285 |
+
- audio;
|
| 286 |
+
- heard;
|
| 287 |
+
- emitting sound;
|
| 288 |
+
- playing instrument 等。
|
| 289 |
+
|
| 290 |
+
**Stage B: 精筛**
|
| 291 |
+
|
| 292 |
+
训练出 w/o Audio 版本后,将满足以下条件的样本归入 strict audio-critical subset:
|
| 293 |
+
|
| 294 |
+
1. Full model 预测正确或显著优于阈值;
|
| 295 |
+
2. w/o Audio 模型预测错误或 J/F 显著下降;
|
| 296 |
+
3. 视频中存在至少两个视觉候选,单靠视觉无法稳定区分目标。
|
| 297 |
+
|
| 298 |
+
这样避免“表达包含音频词但视觉上唯一可解”的伪 audio-critical 样本。
|
| 299 |
+
|
| 300 |
+
---
|
| 301 |
+
|
| 302 |
+
#### 3.2.3 Same-category distractor subset
|
| 303 |
+
|
| 304 |
+
视频中存在多个同类别或高度相似候选对象,表达需要区分实例。
|
| 305 |
+
|
| 306 |
+
优先数据来源:
|
| 307 |
+
|
| 308 |
+
1. 数据集原始 object annotations;
|
| 309 |
+
2. 若无现成标注,使用 CLIP / Grounding DINO / OWL-ViT 进行 zero-shot object discovery;
|
| 310 |
+
3. 结合 SAM2 proposals 的 mask-pooled CLIP similarity 聚类,近似识别同类候选。
|
| 311 |
+
|
| 312 |
+
该子集需要报告构建方式和人工抽查准确率,避免 Reviewer 质疑子集可靠性。
|
| 313 |
+
|
| 314 |
+
---
|
| 315 |
+
|
| 316 |
+
#### 3.2.4 Null subset
|
| 317 |
+
|
| 318 |
+
原始 Null 样本,并进一步区分:
|
| 319 |
+
|
| 320 |
+
1. visual object exists but not referred;
|
| 321 |
+
2. audio exists but no valid visual target;
|
| 322 |
+
3. text refers to absent object;
|
| 323 |
+
4. audio-text conflict / ambiguous null。
|
| 324 |
+
|
| 325 |
+
---
|
| 326 |
+
|
| 327 |
+
#### 3.2.5 Small / occluded target subset
|
| 328 |
+
|
| 329 |
+
用于分析 proposal miss。
|
| 330 |
+
|
| 331 |
+
初始定义:
|
| 332 |
+
|
| 333 |
+
- small:GT mask area 小于图像面积的 5%;
|
| 334 |
+
- heavily occluded:连续可见帧少于 \(0.5T\),或 mask area 在时序上剧烈波动;
|
| 335 |
+
- partial target:目标只在部分帧出现。
|
| 336 |
+
|
| 337 |
+
---
|
| 338 |
+
|
| 339 |
+
#### 3.2.6 Multi-expression H3 subset
|
| 340 |
+
|
| 341 |
+
用于直接验证 H3。
|
| 342 |
+
|
| 343 |
+
样本条件:
|
| 344 |
+
|
| 345 |
+
1. 同一视频中存在至少两个 referring expressions;
|
| 346 |
+
2. 这些 expressions 指向同一 GT object / GT tube,或至少指向可稳定匹配的同一 target instance;
|
| 347 |
+
3. expressions 在语义上存在差异,例如类别、动作、音频、空间关系、交互对象或时序片段不同;
|
| 348 |
+
4. SAM2 proposals 中存在 matched GT tube,避免 proposal miss 干扰 H3 验证。
|
| 349 |
+
|
| 350 |
+
报告内容:
|
| 351 |
+
|
| 352 |
+
- 每个视频平均 expression 数;
|
| 353 |
+
- 每个 GT object 平均 expression 数;
|
| 354 |
+
- H3 subset 样本数���;
|
| 355 |
+
- expression 差异类型分布;
|
| 356 |
+
- 人工抽查准确率。
|
| 357 |
+
|
| 358 |
+
## 4. Phase 0: Proposal Recall 与 Oracle 上界预实验
|
| 359 |
+
|
| 360 |
+
这是 TubeToken 的 go / no-go 实验。若 proposal recall 或 oracle upper bound 不足,TubeToken 的性能上限会被 proposal 阶段限制。
|
| 361 |
+
|
| 362 |
+
### 4.0 Phase -1 前置基准线:SimToken 复现
|
| 363 |
+
|
| 364 |
+
在运行 Proposal Recall 与 Oracle 上界之前,必须先完成 SimToken 复现。
|
| 365 |
+
|
| 366 |
+
要求:
|
| 367 |
+
|
| 368 |
+
1. 使用与 TubeToken 后续实验一致的数据划分、输入分辨率、音频特征、训练 epoch、batch size、optimizer、scheduler 和 evaluation script。
|
| 369 |
+
2. 以作者复现的 SimToken J/F/S 作为所有 Go/No-Go 条件中的主基准。
|
| 370 |
+
3. 官方 SimToken 数字只作为旁注;若复现数字与官方数字差异超过 1.5 J&F,需要先定位差异来源。
|
| 371 |
+
4. 论文中明确写作:
|
| 372 |
+
|
| 373 |
+
> All comparisons are conducted under the same training configuration as SimToken (reproduced), with official results cited where applicable.
|
| 374 |
+
|
| 375 |
+
### 4.1 设置
|
| 376 |
+
|
| 377 |
+
- Proposal model: SAM2 automatic mask generation。
|
| 378 |
+
- 关键帧策略:
|
| 379 |
+
- stride = 4;
|
| 380 |
+
- stride = 8;
|
| 381 |
+
- stride = 16;
|
| 382 |
+
- first / middle / last + audio-peak frames;
|
| 383 |
+
- uniform + motion-peak frames;
|
| 384 |
+
- uniform + audio-peak + motion-peak frames。
|
| 385 |
+
- Propagation: 使用 SAM2 memory / tracking 机制生成完整 tube。
|
| 386 |
+
- Candidate numbers: \(N=16,32,64,128\)。
|
| 387 |
+
|
| 388 |
+
---
|
| 389 |
+
|
| 390 |
+
### 4.2 Tube matching 定义
|
| 391 |
+
|
| 392 |
+
v3 使用 **GT-visible-frame mean tube IoU**,避免 late-target 或 partial target 样本被空帧稀释。
|
| 393 |
+
|
| 394 |
+
令:
|
| 395 |
+
|
| 396 |
+
\[
|
| 397 |
+
\mathcal{T}_g = \{t \mid g_t \neq \emptyset\}
|
| 398 |
+
\]
|
| 399 |
+
|
| 400 |
+
则:
|
| 401 |
+
|
| 402 |
+
\[
|
| 403 |
+
IoU_{tube}(o_i, g)=
|
| 404 |
+
\frac{1}{|\mathcal{T}_g|}
|
| 405 |
+
\sum_{t \in \mathcal{T}_g}
|
| 406 |
+
IoU(m_{i,t}, g_t)
|
| 407 |
+
\]
|
| 408 |
+
|
| 409 |
+
若:
|
| 410 |
+
|
| 411 |
+
\[
|
| 412 |
+
\max_i IoU_{tube}(o_i, g) \ge 0.5
|
| 413 |
+
\]
|
| 414 |
+
|
| 415 |
+
则认为 GT 被 proposal 覆盖。
|
| 416 |
+
|
| 417 |
+
同时报告更严格版本:
|
| 418 |
+
|
| 419 |
+
\[
|
| 420 |
+
IoU_{tube}^{all}
|
| 421 |
+
=
|
| 422 |
+
\frac{1}{T}
|
| 423 |
+
\sum_{t=1}^{T}
|
| 424 |
+
IoU(m_{i,t}, g_t)
|
| 425 |
+
\]
|
| 426 |
+
|
| 427 |
+
用于分析 tube 在 GT 不存在帧是否产生多余 mask。
|
| 428 |
+
|
| 429 |
+
### 4.2.1 Oracle Refined J/F 精确定义
|
| 430 |
+
|
| 431 |
+
**Oracle Tube J/F**:在 top-N candidate tubes 中选择 \(IoU_{tube}\) 最高的 tube,直接评估该 tube mask 的 J/F。
|
| 432 |
+
|
| 433 |
+
**Oracle Refined J/F**:在 top-N candidate tubes 中选择 oracle tube,只使用该 tube 的 bbox 作为 SAM / SAM2 box prompt,经 refinement 后评估 J/F。
|
| 434 |
+
|
| 435 |
+
约束:
|
| 436 |
+
|
| 437 |
+
1. 不允许使用 GT mask 作为 mask prompt;
|
| 438 |
+
2. 不允许使用 oracle GT box;
|
| 439 |
+
3. bbox 来自 oracle proposal tube;
|
| 440 |
+
4. refinement 设置必须与实际 Stage 4 默认设置一致。
|
| 441 |
+
|
| 442 |
+
这样 Oracle Refined J/F 才是实际 TubeToken refinement 的可达上界,而不是依赖 GT mask 的理想化上界。
|
| 443 |
+
|
| 444 |
+
---
|
| 445 |
+
|
| 446 |
+
### 4.3 指标
|
| 447 |
+
|
| 448 |
+
| 指标 | 解释 |
|
| 449 |
+
|---|---|
|
| 450 |
+
| Recall@16 / 32 / 64 / 128 | top-N tubes 中是否存在 GT tube |
|
| 451 |
+
| Oracle Tube J/F | 总是选择 \(IoU_{tube}\) 最高 tube 的 proposal 上界 |
|
| 452 |
+
| Oracle Refined J/F | 选择 oracle tube 后,仅用 proposal bbox prompt 做 SAM refinement 的上界 |
|
| 453 |
+
| Proposal coverage by subset | 在 late-target、small、occluded、unseen 上分别报告 |
|
| 454 |
+
| Proposal miss % | 未覆盖 GT 的样本比例 |
|
| 455 |
+
| Average tubes per video | 计算量和 pruning 难度 |
|
| 456 |
+
| Proposal generation latency | 评估效率 |
|
| 457 |
+
| Tube temporal purity | tube 是否在 GT 不存在帧产生大量 false positive |
|
| 458 |
+
|
| 459 |
+
---
|
| 460 |
+
|
| 461 |
+
### 4.4 Go / No-Go 决策标准
|
| 462 |
+
|
| 463 |
+
下列阈值中的 SimToken 均指 **作者复现的 SimToken**,不是仅引用官方数字。
|
| 464 |
+
|
| 465 |
+
#### 4.4.1 Milestone 1 绿灯条件
|
| 466 |
+
|
| 467 |
+
同时满足:
|
| 468 |
+
|
| 469 |
+
1. Recall@32 ≥ 85%,其中 matching 使用 GT-visible-frame IoU ≥ 0.5;
|
| 470 |
+
2. Oracle Tube J/F ≥ reproduced SimToken J/F + 5%;
|
| 471 |
+
3. Oracle Refined J/F ≥ Oracle Tube J/F + 3%,说明 SAM refinement 有明确提升空间;
|
| 472 |
+
4. Small / occluded subset Recall@32 ≥ 70%,避免 proposal 对关键困难样本存在系统性盲区。
|
| 473 |
+
|
| 474 |
+
策略:TubeToken 正常推进,默认 Balanced 配置使用 \(N=32\)。
|
| 475 |
+
|
| 476 |
+
#### 4.4.2 Milestone 1 黄灯条件
|
| 477 |
+
|
| 478 |
+
| 条件 | 后续策略 |
|
| 479 |
+
|---|---|
|
| 480 |
+
| Recall@32 为 80%-85%,且 Oracle Tube J/F 满足绿灯条件 | 继续推进,但默认 \(N=64\),并在论文中重点分析 proposal miss |
|
| 481 |
+
| Oracle Tube J/F 仅 ≥ SimToken + 2%,但 Oracle Refined J/F ≥ SimToken + 5% | 继续推进,但论文重心从 selection 转向 refinement;强调 proposal-conditioned refinement |
|
| 482 |
+
| Recall@32 ≥ 85%,但 small/occluded Recall@32 < 70% | 继续推进主线,但必须增加 detector-assisted proposals 或 high-resolution proposals 的备选实验 |
|
| 483 |
+
|
| 484 |
+
#### 4.4.3 Milestone 1 红灯条件
|
| 485 |
+
|
| 486 |
+
任一条件满足即暂停 TubeToken 主线,优先切换 EC-SimToken 或重做 proposal 阶段:
|
| 487 |
+
|
| 488 |
+
1. Recall@64 < 80%;
|
| 489 |
+
2. Oracle Tube J/F ≤ reproduced SimToken J/F;
|
| 490 |
+
3. Recall@32 ≥ 85%,且 Oracle Refined J/F 与 Oracle Tube J/F 差距 < 1%,且 Oracle Tube J/F ≤ reproduced SimToken J/F + 2%。
|
| 491 |
+
|
| 492 |
+
第三条红灯条件只使用 Phase 0 可观测量。其含义是:proposal 质量本身只比 SimToken 略好,bbox-only refinement 又几乎无增益,此时 TubeToken 在该数据集上缺��足够立足点,不应依赖 Milestone 2 之前无法验证的 selection 收益预期。
|
| 493 |
+
|
| 494 |
+
---
|
| 495 |
+
|
| 496 |
+
### 4.5 若 recall 不足的备选策略
|
| 497 |
+
|
| 498 |
+
1. 增加关键帧数量;
|
| 499 |
+
2. 使用 audio-peak / motion-peak keyframes;
|
| 500 |
+
3. 对文本中出现的类别词使用 open-vocabulary detector 生成 boxes,再送 SAM2;
|
| 501 |
+
4. 使用 SimToken / EC-SimToken 的 mask 作为额外 proposal;
|
| 502 |
+
5. 引入 hybrid fallback:若 proposal confidence 低,则回退到 global semantic prompt segmentation。
|
| 503 |
+
|
| 504 |
+
## 5. Baseline 与模型变体
|
| 505 |
+
|
| 506 |
+
### 5.1 必须复现 / 对比的模型
|
| 507 |
+
|
| 508 |
+
| 模型 | 用途 |
|
| 509 |
+
|---|---|
|
| 510 |
+
| EEMC | 原始 Ref-AVS baseline |
|
| 511 |
+
| TSAM | SAM-based Ref-AVS baseline |
|
| 512 |
+
| SAM2-LOVE | SAM2-based Ref-AVS baseline |
|
| 513 |
+
| SimToken | 最直接对比对象,必须复现 |
|
| 514 |
+
| EC-SimToken | 强化后的 global token baseline,用于证明 TubeToken 不是只打 weak baseline |
|
| 515 |
+
| SimToken + SAM2 proposals | 控制 SAM2 proposals 带来的收益,采用零参数 reranking |
|
| 516 |
+
| SAM2 proposals + learned reranker(no null tube) | 分离 learned tube reranker 与 null tube 的贡献 |
|
| 517 |
+
| SimToken + matched compute | 等计算量公平对照 |
|
| 518 |
+
| TubeToken-Minimal | 最小 tube selection 框架 |
|
| 519 |
+
| TubeToken-Full | 完整方法 |
|
| 520 |
+
|
| 521 |
+
如果无法完整复现 EEMC、TSAM、SAM2-LOVE,可引用官方结果;但 SimToken、SimToken + SAM2 proposals、SAM2 proposals + learned reranker、SimToken + matched compute、TubeToken 必须在同一训练 / 输入 / 评估设置下比较。
|
| 522 |
+
|
| 523 |
+
---
|
| 524 |
+
|
| 525 |
+
### 5.2 TubeToken 主要消融
|
| 526 |
+
|
| 527 |
+
| 变体 | 目的 |
|
| 528 |
+
|---|---|
|
| 529 |
+
| TubeToken-Full | 完整模型 |
|
| 530 |
+
| TubeToken-Minimal | SAM2 proposals + fixed tube feature + selector + null tube,无 CondQFormer,无 refinement |
|
| 531 |
+
| SAM2 proposals + learned reranker(no null tube) | 分离 learned selector 与 null tube 的贡献 |
|
| 532 |
+
| w/o null tube | 验证显式 Null 建模 |
|
| 533 |
+
| null tube → binary existence head | 比较 null tube 与额外二分类 head |
|
| 534 |
+
| w/o null tube + mask-area threshold | 区分 Null 性能来自 tube 框架还是 null tube 设计 |
|
| 535 |
+
| fixed Q-Former | 验证 conditioning 是否有效,而非参数量增加 |
|
| 536 |
+
| text-conditioned only | 验证文本条件贡献 |
|
| 537 |
+
| audio-conditioned only | 验证音频条件贡献 |
|
| 538 |
+
| text+audio-conditioned | 完整条件化压缩 |
|
| 539 |
+
| w/o inter-tube self-attention | 验证 tube 间相对比较是否必要 |
|
| 540 |
+
| independent tube scoring | 每个 tube 独立通过 \([q_{ref};z_i]\) 线性打分 |
|
| 541 |
+
| w/o SAM refinement | 验证 tube selection 本身能力 |
|
| 542 |
+
| bbox prompt refinement | 默认 refinement 方案 |
|
| 543 |
+
| bbox + semantic prompt refinement | 验证 semantic prompt 是否有贡献 |
|
| 544 |
+
| bbox + mask prompt refinement | 检查 mask prompt 是否会带来收益或过拟合 |
|
| 545 |
+
| N=16/32/64/128 | 分析 candidate 数量和 recall/效率 trade-off |
|
| 546 |
+
| stride=4/8/16 | 分析关键帧数量和效率 trade-off |
|
| 547 |
+
|
| 548 |
+
---
|
| 549 |
+
|
| 550 |
+
### 5.3 公平性控制变体
|
| 551 |
+
|
| 552 |
+
#### 5.3.1 SimToken + SAM2 proposals:零参数 proposal reranking baseline
|
| 553 |
+
|
| 554 |
+
目的:回答 “TubeToken 的提升是否只是因为使用了 SAM2 proposals?”
|
| 555 |
+
|
| 556 |
+
该 baseline 必须采用参数无关的 reranking,不能使用模糊的 “rerank or fusion” 写法。
|
| 557 |
+
|
| 558 |
+
实现:
|
| 559 |
+
|
| 560 |
+
1. 保持 SimToken 的 global `<SEG>` 生成方式,得到 \(F_{seg}\)。
|
| 561 |
+
2. 使用与 TubeToken 完全相同的 SAM2 proposals 和 tube construction。
|
| 562 |
+
3. 对每个 proposal tube 提取时序 mask-pooled feature \(f_{i,t}\)。
|
| 563 |
+
4. 使用如下零参数分数:
|
| 564 |
+
|
| 565 |
+
\[
|
| 566 |
+
\text{score}(o_i)
|
| 567 |
+
=
|
| 568 |
+
F_{seg}^{\top}
|
| 569 |
+
\cdot
|
| 570 |
+
\frac{1}{|\mathcal{T}|}
|
| 571 |
+
\sum_t f_{i,t}
|
| 572 |
+
\]
|
| 573 |
+
|
| 574 |
+
5. 选择分数最高的 proposal tube,并使用与 TubeToken-Minimal 一致的输出设置。
|
| 575 |
+
|
| 576 |
+
该方案不引入额外可学习参数,与 SimToken 的 \(F_{seg}\) 使用方式一致,能最大限度避免 Reviewer 质疑对照组被弱化。
|
| 577 |
+
|
| 578 |
+
---
|
| 579 |
+
|
| 580 |
+
#### 5.3.2 SAM2 proposals + learned reranker(no null tube)
|
| 581 |
+
|
| 582 |
+
目的:回答 “TubeToken-Minimal 的提升来自 learned tube selector,还是来自 null tube?”
|
| 583 |
+
|
| 584 |
+
实现:
|
| 585 |
+
|
| 586 |
+
1. 使用与 TubeToken-Minimal 相同的 SAM2 proposals、tube construction、tube feature 和 \(q_{ref}\)。
|
| 587 |
+
2. 训练一个 learned reranker / classifier 对非 null candidate tubes 打分。
|
| 588 |
+
3. 不加入 learnable null tube。
|
| 589 |
+
4. Null case 使用 mask-area threshold 或 calibrated score threshold 处理。
|
| 590 |
+
5. 与 TubeToken-Minimal 对比:若 TubeToken-Minimal 明显更好,说明 null tube 有独立贡献;若 learned reranker 已接近 TubeToken-Minimal,说明主要收益来自 learned tube selection。
|
| 591 |
+
|
| 592 |
+
---
|
| 593 |
+
|
| 594 |
+
#### 5.3.3 SimToken + matched compute:预注册等计算量 baseline
|
| 595 |
+
|
| 596 |
+
目的:回答 “TubeToken 是否只是计算量换性能?”
|
| 597 |
+
|
| 598 |
+
v4 固定唯一实现,不再保留多个候选方案:
|
| 599 |
+
|
| 600 |
+
> **SimToken + multiple keyframe prompting with the same number of keyframes as TubeToken-Fast.**
|
| 601 |
+
|
| 602 |
+
实现约定:
|
| 603 |
+
|
| 604 |
+
1. 使用与 TubeToken-Fast 相同数量的关键帧,默认对应 TubeToken-Fast 的 stride=16 keyframe budget。
|
| 605 |
+
2. 对每个关键帧分别运行 SimToken 的 global `<SEG>` / SAM prompting 流程。
|
| 606 |
+
3. 将多�� keyframe 的预测通过同一 propagation / aggregation 规则合成为视频级 mask,规则必须在实验前固定。
|
| 607 |
+
4. 不使用 SAM2 proposal tube reranking,不引入 learned tube selector,不引入 null tube。
|
| 608 |
+
5. 报告 latency、FLOPs、SAM/SAM2 call 数、MLLM token count,使其与 TubeToken-Fast 的计算预算尽可能接近。
|
| 609 |
+
|
| 610 |
+
选择该实现的原因:TubeToken-Fast 的额外计算主要来自更多关键帧与 proposal/propagation 处理,而 multiple keyframe prompting 是 SimToken 侧最直接、最可解释、最难被质疑的等计算量增强方式。该 baseline 必须在实验开始前预注册,不能根据最终结果临时更换。
|
| 611 |
+
|
| 612 |
+
## 6. 训练设计
|
| 613 |
+
|
| 614 |
+
### 6.1 Tube label assignment
|
| 615 |
+
|
| 616 |
+
正样本视频中,选择 GT-visible-frame mean tube IoU 最大的 candidate tube 作为正 tube:
|
| 617 |
+
|
| 618 |
+
\[
|
| 619 |
+
i^* = \arg\max_i IoU_{tube}(o_i,g)
|
| 620 |
+
\]
|
| 621 |
+
|
| 622 |
+
若最大 IoU 小于 0.5,则标记为 proposal miss。训练时:
|
| 623 |
+
|
| 624 |
+
- 不用于 tube classification loss;
|
| 625 |
+
- 可用于 proposal miss 统计;
|
| 626 |
+
- 不建议强行把低 IoU tube 当正样本,以免污染 selector。
|
| 627 |
+
|
| 628 |
+
Null 样本中,正类为 null tube。
|
| 629 |
+
|
| 630 |
+
---
|
| 631 |
+
|
| 632 |
+
### 6.2 Loss function
|
| 633 |
+
|
| 634 |
+
v3 默认总损失中 **不包含未定义的 \(\mathcal{L}_{cond}\)**。Null 加权并入 tube classification CE,而不是单独写成独立的 \(\mathcal{L}_{null}\)。
|
| 635 |
+
|
| 636 |
+
默认总损失:
|
| 637 |
+
|
| 638 |
+
\[
|
| 639 |
+
\mathcal{L}
|
| 640 |
+
=
|
| 641 |
+
\mathcal{L}_{tube}^{weighted}
|
| 642 |
+
+
|
| 643 |
+
\lambda_m y\mathcal{L}_{mask}
|
| 644 |
+
+
|
| 645 |
+
\lambda_r\mathcal{L}_{rank}
|
| 646 |
+
\]
|
| 647 |
+
|
| 648 |
+
其中:
|
| 649 |
+
|
| 650 |
+
\[
|
| 651 |
+
\mathcal{L}_{tube}^{weighted}
|
| 652 |
+
=
|
| 653 |
+
\sum_i
|
| 654 |
+
w_i \cdot
|
| 655 |
+
\text{CE}(P(i \mid video,audio,text), y_i)
|
| 656 |
+
\]
|
| 657 |
+
|
| 658 |
+
- 正样本:\(w_i=1\);
|
| 659 |
+
- Null 样本:\(w_i=w_{null}\),由 curriculum 控制;
|
| 660 |
+
- \(\mathcal{L}_{mask}\):BCE + Dice,只对非 Null 且非 proposal miss 样本计算;
|
| 661 |
+
- \(\mathcal{L}_{rank}\):hard negative ranking loss。
|
| 662 |
+
|
| 663 |
+
Hard negative ranking:
|
| 664 |
+
|
| 665 |
+
\[
|
| 666 |
+
\mathcal{L}_{rank}
|
| 667 |
+
=
|
| 668 |
+
\sum_{j\in\mathcal{N}}
|
| 669 |
+
\max(0,\Delta-s_{i^*}+s_j)
|
| 670 |
+
\]
|
| 671 |
+
|
| 672 |
+
#### 6.2.1 Optional \(\mathcal{L}_{cond}\) 辅助项
|
| 673 |
+
|
| 674 |
+
如果实验中决定使用 attention supervision,则 \(\mathcal{L}_{cond}\) 必须单独定义、单独消融,不能作为默认损失悬空出现。
|
| 675 |
+
|
| 676 |
+
可选定义:
|
| 677 |
+
|
| 678 |
+
\[
|
| 679 |
+
\mathcal{L}_{cond}
|
| 680 |
+
=
|
| 681 |
+
-
|
| 682 |
+
\sum_{t,l}
|
| 683 |
+
\bar{M}_{t,l}
|
| 684 |
+
\log A_{t,l}
|
| 685 |
+
\]
|
| 686 |
+
|
| 687 |
+
其中:
|
| 688 |
+
|
| 689 |
+
- \(A_{t,l}\):CondQFormer 对第 \(t\) 帧第 \(l\) 个 patch / region 的 attention;
|
| 690 |
+
- \(\bar{M}_{t,l}\):归一化后的 GT mask 或 matched proposal mask;
|
| 691 |
+
- 该项只在有可靠 GT spatial supervision 的样本上使用。
|
| 692 |
+
|
| 693 |
+
若使用该项,则总损失写为:
|
| 694 |
+
|
| 695 |
+
\[
|
| 696 |
+
\mathcal{L}
|
| 697 |
+
=
|
| 698 |
+
\mathcal{L}_{tube}^{weighted}
|
| 699 |
+
+
|
| 700 |
+
\lambda_m y\mathcal{L}_{mask}
|
| 701 |
+
+
|
| 702 |
+
\lambda_r\mathcal{L}_{rank}
|
| 703 |
+
+
|
| 704 |
+
\lambda_c\mathcal{L}_{cond}
|
| 705 |
+
\]
|
| 706 |
+
|
| 707 |
+
并报告 with / without \(\mathcal{L}_{cond}\)。
|
| 708 |
+
|
| 709 |
+
---
|
| 710 |
+
|
| 711 |
+
### 6.3 Multi-expression training for CondQFormer
|
| 712 |
+
|
| 713 |
+
这是 H3 在训练层面的必要实现。
|
| 714 |
+
|
| 715 |
+
适用前提:数据审计确认同一视频或同一 GT object 存在多个 referring expressions。
|
| 716 |
+
|
| 717 |
+
训练方式:
|
| 718 |
+
|
| 719 |
+
1. 对每个 multi-expression 样本,先生成一次 SAM2 proposals,得到共享 candidate tubes \(\mathcal{O}\)。
|
| 720 |
+
2. 在同一个 batch 或 gradient accumulation window 中采样至少两个不同 expressions:\(r_a, r_b\)。
|
| 721 |
+
3. 对同一组 tubes 分别构造条件化 query:
|
| 722 |
+
|
| 723 |
+
\[
|
| 724 |
+
Q_a = Q_0 + W_t e_{text}^{a} + W_a e_{audio}^{a} + W_{ta}(e_{text}^{a} \odot e_{audio}^{a})
|
| 725 |
+
\]
|
| 726 |
+
|
| 727 |
+
\[
|
| 728 |
+
Q_b = Q_0 + W_t e_{text}^{b} + W_a e_{audio}^{b} + W_{ta}(e_{text}^{b} \odot e_{audio}^{b})
|
| 729 |
+
\]
|
| 730 |
+
|
| 731 |
+
4. 分别得到:
|
| 732 |
+
|
| 733 |
+
\[
|
| 734 |
+
\tilde{z}_{i}^{a} = \text{CondQFormer}(Q_a, \{f_{i,t}\}_{t=1}^{T})
|
| 735 |
+
\]
|
| 736 |
+
|
| 737 |
+
\[
|
| 738 |
+
\tilde{z}_{i}^{b} = \text{CondQFormer}(Q_b, \{f_{i,t}\}_{t=1}^{T})
|
| 739 |
+
\]
|
| 740 |
+
|
| 741 |
+
5. 共享 tube proposals,但每个 expression 独立计算 tube selection loss。
|
| 742 |
+
6. 如果两个 expressions 指向同一 GT tube,则要求 selection 都正确;不强制 \(\tilde{z}_{i}^{a}\) 与 \(\tilde{z}_{i}^{b}\) 相同,因为 H3 恰恰要求不同 expression 暴露不同证据。
|
| 743 |
+
7. 如果两个 expressions 指向不同 targets,则作为 inter-expression hard negatives,用于强化同视频 instance discrimination。
|
| 744 |
+
|
| 745 |
+
**实现注记:梯度冲突风险。**
|
| 746 |
+
当两个 expressions 对同一 tube 需要关注不同证据时,例如一个表达依赖音频活跃帧,另一个表达依赖空间位置,CondQFormer 的共享参数可能收到相互冲突的梯度,造成训练振荡。若出现 loss oscillation、attention collapse 或正样本 Selection Acc 明显下降,采用以下缓解策略:
|
| 747 |
+
|
| 748 |
+
1. 将不同 expression 的 forward / backward 放入同一 gradient accumulation window,但分开计算梯度后再累积,而不是在一个合并 forward 中强行混合;
|
| 749 |
+
2. 训练早期优先采样语义差异较小的 expression pair,例如同为视觉表达或同为音频表达;
|
| 750 |
+
3. 训练稳定后再逐步加入 cross-modality expression pair,例如 audio-expression vs spatial-expression;
|
| 751 |
+
4. 单独记录 multi-expression pair 类型与训练稳定性,避免把梯度冲突误判为 conditioning 无效。
|
| 752 |
+
|
| 753 |
+
训练记录:
|
| 754 |
+
|
| 755 |
+
- batch 中 multi-expression 样本比例;
|
| 756 |
+
- 每个 shared proposal set 对应的 expression 数;
|
| 757 |
+
- expression pair 类型分布:visual-visual、audio-audio、visual-audio、spatial-audio;
|
| 758 |
+
- 使用 multi-expression training 与不使用该训练策略的对比结果。
|
| 759 |
+
|
| 760 |
+
若数据集不支持 multi-expression training,则必须在论文中降低 H3 的表述强度。
|
| 761 |
+
|
| 762 |
+
---
|
| 763 |
+
|
| 764 |
+
### 6.4 Null tube curriculum
|
| 765 |
+
|
| 766 |
+
Null tube 训练初期不稳定,因此采用 curriculum:
|
| 767 |
+
|
| 768 |
+
| 阶段 | epoch | Null 权重 \(w_{null}\) |
|
| 769 |
+
|---|---:|---:|
|
| 770 |
+
| Warmup | 0-2 | 2.0 |
|
| 771 |
+
| Middle | 3-6 | 1.0 |
|
| 772 |
+
| Final | 7+ | 0.5 |
|
| 773 |
+
|
| 774 |
+
同时使用 Null oversampling,但必须明确目标比例。
|
| 775 |
+
|
| 776 |
+
默认设置:
|
| 777 |
+
|
| 778 |
+
- 每个 batch 中 Null 样本目标比例:25%;
|
| 779 |
+
- 若原始 Null 比例高于 25%,不额外下采样,直接使用自然分布;
|
| 780 |
+
- 若原始 Null 比例低于 25%,通过 oversampling 补足;
|
| 781 |
+
- 单个 batch 中 Null 比例原则上不超过 33%,除非专门做采样比例消融。
|
| 782 |
+
|
| 783 |
+
必须报告 Null sampling ratio 对以下指标的影响:
|
| 784 |
+
|
| 785 |
+
- Null FPR;
|
| 786 |
+
- Positive FNR;
|
| 787 |
+
- Null S;
|
| 788 |
+
- Tube Selection Acc@1;
|
| 789 |
+
- “GT tube Top-3 but null tube Top-1” 错误比例。
|
| 790 |
+
|
| 791 |
+
Null sampling ratio 消融:
|
| 792 |
+
|
| 793 |
+
| Ratio | 目的 |
|
| 794 |
+
|---:|---|
|
| 795 |
+
| 0% | no oversampling baseline |
|
| 796 |
+
| 12.5% | 弱 oversampling |
|
| 797 |
+
| 25% | 默认设置 |
|
| 798 |
+
| 33% | 较强 oversampling |
|
| 799 |
+
| 50% | 检查是否导致过度保守预测 null |
|
| 800 |
+
|
| 801 |
+
---
|
| 802 |
+
|
| 803 |
+
### 6.5 Hard negative mining
|
| 804 |
+
|
| 805 |
+
Hard negative mining 分阶段引入,避免工程依赖混乱。
|
| 806 |
+
|
| 807 |
+
#### Milestone 2: TubeToken-Minimal 阶段
|
| 808 |
+
|
| 809 |
+
只使用不依赖 CondQFormer 的 hard negatives:
|
| 810 |
+
|
| 811 |
+
1. tube IoU 与 GT 较高但不是目标;
|
| 812 |
+
2. 与 GT bbox / mask 空间位置接近;
|
| 813 |
+
3. mask-pooled visual feature 与 GT 相似;
|
| 814 |
+
4. 若有类别标签,则加入同类别不同实例。
|
| 815 |
+
|
| 816 |
+
#### Milestone 3: CondQFormer 阶段
|
| 817 |
+
|
| 818 |
+
加入 text/audio mismatch negatives:
|
| 819 |
+
|
| 820 |
+
1. 与文本相似但音频不匹配;
|
| 821 |
+
2. 与音频同步但文本不匹配;
|
| 822 |
+
3. 与 audio-critical expression 高相关但不是 GT 的 tube;
|
| 823 |
+
4. same-category distractor 中的高分错误 tube;
|
| 824 |
+
5. 同一视频不同 expression 指向不同目标时,将非当前 expression 的目标 tube 作为 hard negative。
|
| 825 |
+
|
| 826 |
+
## 7. 评价指标
|
| 827 |
+
|
| 828 |
+
### 7.1 标准指标
|
| 829 |
+
|
| 830 |
+
| 指标 | 说明 |
|
| 831 |
+
|---|---|
|
| 832 |
+
| Seen J / F / J&F | seen categories 分割质量 |
|
| 833 |
+
| Unseen J / F / J&F | unseen categories 泛化能力 |
|
| 834 |
+
| Mix J / F / J&F | 综合表现 |
|
| 835 |
+
| Null S | Null subset 空目标表现 |
|
| 836 |
+
|
| 837 |
+
---
|
| 838 |
+
|
| 839 |
+
### 7.2 TubeToken 专属指标
|
| 840 |
+
|
| 841 |
+
| 指标 | 说明 |
|
| 842 |
+
|---|---|
|
| 843 |
+
| Recall@N | proposal 阶段是否覆盖 GT |
|
| 844 |
+
| Oracle Tube J/F | proposal 上界 |
|
| 845 |
+
| Oracle Refined J/F | proposal + bbox-only refinement 上界 |
|
| 846 |
+
| Tube Selection Acc@1 | GT tube 被覆盖时,Top-1 预测是否为 matched GT tube |
|
| 847 |
+
| Tube Selection Acc@3 | matched GT tube 是否进入 Top-3 |
|
| 848 |
+
| GT Top-3 but Null Top-1 Rate | GT tube 已在 Top-3,但 null tube 排名第 1 的比例 |
|
| 849 |
+
| Null Accuracy | 是否正确选择 null tube |
|
| 850 |
+
| Null FPR | Null 视频中错误选择非空 tube 的比例 |
|
| 851 |
+
| Positive FNR | 正样本视频中错误选择 null tube 的比例 |
|
| 852 |
+
| Existence AUC | \(p_{exist}=1-P(null)\) 的判别能力 |
|
| 853 |
+
| Reliability Diagram / ECE | existence probability 是否校准 |
|
| 854 |
+
| Refinement Gain | SAM refinement 前后 J/F 提升 |
|
| 855 |
+
| Latency / FPS / Memory | 效率指标 |
|
| 856 |
+
| \(AC\) | attention mass 是否集中在 GT region / GT tube |
|
| 857 |
+
| \(\widehat{AC}_{tube}\) | 标准化 tube-level AC,定义为 \(N\cdot AC_{tube}\),用于不同 N 之间比较 |
|
| 858 |
+
| H3 Cosine Similarity Gap | 同一 tube 不同 expression 下 conditioned 与 fixed Q-Former 的 \(\tilde{z}_i\) 相似度差异 |
|
| 859 |
+
|
| 860 |
+
**Tube Selection Acc 定义:**
|
| 861 |
+
在 GT tube 被 proposal 覆盖的样本中,selector 的 Top-1 预测与 matched GT tube 一致的比例。proposal miss 样本不计入该指标,但必须单独报告。
|
| 862 |
+
|
| 863 |
+
**Selection Acc@3 的 null 处理:**
|
| 864 |
+
针对正样本评估 object-level Top-3 时,先从候选排名中排除 null tube,再判断 matched GT tube 是否进入 Top-3。否则 null tube 排名第 2 但 GT tube 排名第 3 的情况会被误计为 object selection 成功。与 null 校准相关的情况单独用 **GT Top-3 but Null Top-1 Rate** 报告,该指标在包含 null tube 的完整 ranking 上计算。
|
| 865 |
+
|
| 866 |
+
若 Null 样本少于 200 个,Reliability Diagram 作为主要校准分析,ECE 仅作为辅助数字。
|
| 867 |
+
|
| 868 |
+
---
|
| 869 |
+
|
| 870 |
+
### 7.3 Error decomposition
|
| 871 |
+
|
| 872 |
+
每个失败样本归类为:
|
| 873 |
+
|
| 874 |
+
| 错误类型 | 判定标准 |
|
| 875 |
+
|---|---|
|
| 876 |
+
| Proposal miss | top-N candidate tubes 中无 tube 与 GT-visible-frame mean IoU ≥ 0.5 |
|
| 877 |
+
| Selection error | GT tube 存在,且非 null tube 被错误选择为其他 object tube |
|
| 878 |
+
| Refinement error | selector 选对,但 refined mask J/F 明显低 |
|
| 879 |
+
| Null false positive | Null 视频中选择了非空 tube |
|
| 880 |
+
| Null false negative | 正样本视频中选择了 null tube |
|
| 881 |
+
| GT tube Top-3 but Null Top-1 | 正样本中 matched GT tube 已进入 Top-3,但 null tube 得分最高 |
|
| 882 |
+
|
| 883 |
+
最后一类不应简单��入 Selection error 或 Null FN。它说明模型具备候选识别能力,但 existence / null 校准存在问题。
|
| 884 |
+
|
| 885 |
+
**互斥归类优先级:**
|
| 886 |
+
Error decomposition 必须保证每个失败样本只落入一个类别,避免各项占比相互重叠。默认优先级为:
|
| 887 |
+
|
| 888 |
+
1. Proposal miss;
|
| 889 |
+
2. Null FN with GT Top-3,即正样本中 null ranked 1st 且 matched GT tube 进入 object-level Top-3;
|
| 890 |
+
3. Null FN without GT Top-3;
|
| 891 |
+
4. Selection error;
|
| 892 |
+
5. Refinement error;
|
| 893 |
+
6. Null FP。
|
| 894 |
+
|
| 895 |
+
报告时可以把第 2、3 类合并成总 Null FN,同时单独列出 GT Top-3 but Null Top-1 作为 Null FN 的校准子类型。
|
| 896 |
+
|
| 897 |
+
该分析需要在 Seen、Unseen、Null、late-target、same-category distractor、audio-critical 子集上分别报告。
|
| 898 |
+
|
| 899 |
+
## 8. 诊断实验
|
| 900 |
+
|
| 901 |
+
### 8.1 Conditioning 是否真的有效
|
| 902 |
+
|
| 903 |
+
v3 将 conditioning 诊断拆成两个层次:
|
| 904 |
+
|
| 905 |
+
1. **Correctness level**:模型是否关注正确 GT 区域 / GT tube。对应 AC 与 \(\widehat{AC}_{tube}\)。
|
| 906 |
+
2. **Expression-sensitivity level**:同一 tube 在不同 referring expressions 下是否产生不同证据摘要。对应 H3 direct validation。
|
| 907 |
+
|
| 908 |
+
这两个层次不能混淆。高 AC 只能说明模型关注正确对象,不能直接证明 H3。
|
| 909 |
+
|
| 910 |
+
#### 8.1.1 H3 direct validation:同一 tube 不同 expression 的表示差异
|
| 911 |
+
|
| 912 |
+
适用子集:3.2.6 Multi-expression H3 subset。
|
| 913 |
+
|
| 914 |
+
实验设置:
|
| 915 |
+
|
| 916 |
+
1. 对同一视频生成一次 shared candidate tubes;
|
| 917 |
+
2. 找到 matched GT tube \(o_{i^*}\);
|
| 918 |
+
3. 对同一视频的两个 expressions \(r_a,r_b\) 分别运行 fixed Q-Former 与 conditioned Q-Former;
|
| 919 |
+
4. 记录同一 tube 的输出表示:\(\tilde{z}_{i^*}^{a}\)、\(\tilde{z}_{i^*}^{b}\)。
|
| 920 |
+
|
| 921 |
+
指标:
|
| 922 |
+
|
| 923 |
+
\[
|
| 924 |
+
\text{CosSim}_{same\ tube}
|
| 925 |
+
=
|
| 926 |
+
\cos(\tilde{z}_{i^*}^{a},\tilde{z}_{i^*}^{b})
|
| 927 |
+
\]
|
| 928 |
+
|
| 929 |
+
报告:
|
| 930 |
+
|
| 931 |
+
| Model | Same-tube cross-expression CosSim | Selection Acc@1 | H3 解释 |
|
| 932 |
+
|---|---:|---:|---|
|
| 933 |
+
| Fixed Q-Former | 1.0 | | 不依赖 expression,确定性恒等 baseline |
|
| 934 |
+
| Text-conditioned | | | 文本差异是否改变 tube summary |
|
| 935 |
+
| Audio-conditioned | | | 音频差异是否改变 tube summary |
|
| 936 |
+
| Text+Audio-conditioned | | | 完整条件化是否产生最大差异 |
|
| 937 |
+
|
| 938 |
+
期望结果:
|
| 939 |
+
|
| 940 |
+
- Fixed Q-Former 的 cross-expression CosSim \(\equiv 1.0\),这是确定性 baseline,而不是经验近似;
|
| 941 |
+
- Text+Audio-conditioned Q-Former 的 CosSim 显著低于 1.0;
|
| 942 |
+
- CosSim 降低不能以 Selection Acc 下降为代价;
|
| 943 |
+
- 若 CosSim 无差异但性能提升存在,则论文表述应改为 “learned compression improves selection”,而不是强称 “expression-conditioned evidence summarization”。
|
| 944 |
+
|
| 945 |
+
---
|
| 946 |
+
|
| 947 |
+
#### 8.1.2 Attention Concentration 指标
|
| 948 |
+
|
| 949 |
+
对于 patch-level 或 frame-level attention \(A\),定义:
|
| 950 |
+
|
| 951 |
+
\[
|
| 952 |
+
AC
|
| 953 |
+
=
|
| 954 |
+
\frac{
|
| 955 |
+
\sum_{t,l} A_{t,l} \cdot \mathbf{1}[(t,l)\in GT]
|
| 956 |
+
}{
|
| 957 |
+
\sum_{t,l} A_{t,l}
|
| 958 |
+
}
|
| 959 |
+
\]
|
| 960 |
+
|
| 961 |
+
若 attention 是 tube-level,则原始 tube attention concentration 为:
|
| 962 |
+
|
| 963 |
+
\[
|
| 964 |
+
AC_{tube}
|
| 965 |
+
=
|
| 966 |
+
\sum_i A_i \cdot \mathbf{1}[i=i^*]
|
| 967 |
+
\]
|
| 968 |
+
|
| 969 |
+
但 \(AC_{tube}\) 受 candidate 数 \(N\) 影响。为保证不同 N 下可比较,v3 使用标准化版本:
|
| 970 |
+
|
| 971 |
+
\[
|
| 972 |
+
\widehat{AC}_{tube}=N\cdot AC_{tube}
|
| 973 |
+
\]
|
| 974 |
+
|
| 975 |
+
其中随机基准恒为 1.0,完全集中在 GT tube 上时为 \(N\)。
|
| 976 |
+
|
| 977 |
+
比较:
|
| 978 |
+
|
| 979 |
+
- fixed Q-Former;
|
| 980 |
+
- text-conditioned;
|
| 981 |
+
- audio-conditioned;
|
| 982 |
+
- text+audio-conditioned。
|
| 983 |
+
|
| 984 |
+
并在以下表达类型上分别报告:
|
| 985 |
+
|
| 986 |
+
1. audio-related expressions;
|
| 987 |
+
2. spatial relation expressions;
|
| 988 |
+
3. category-only expressions;
|
| 989 |
+
4. same-category distractor samples;
|
| 990 |
+
5. multi-expression H3 subset。
|
| 991 |
+
|
| 992 |
+
---
|
| 993 |
+
|
| 994 |
+
### 8.2 Audio robustness
|
| 995 |
+
|
| 996 |
+
| 实验 | 目的 |
|
| 997 |
+
|---|---|
|
| 998 |
+
| audio removed | 测试音频模块整体贡献 |
|
| 999 |
+
| audio amplitude zeroed, temporal length preserved | 区分音频缺失与全零音频特征;检查模型是否只利用“有无音频”信号 |
|
| 1000 |
+
| audio shuffled | 测试是否依赖时间同步 |
|
| 1001 |
+
| same-category audio swapped | 测试是否依赖细粒度音频差异 |
|
| 1002 |
+
| cross-category audio swapped | 测试是否使用音频语义,而非只检测音频存在 |
|
| 1003 |
+
| audio-text conflict | 测试冲突条件下模型是否合理退化 |
|
| 1004 |
+
| strict audio-critical subset | 测试音频关键样本上的收益 |
|
| 1005 |
+
|
| 1006 |
+
**Audio swapped 分组要求**:
|
| 1007 |
+
|
| 1008 |
+
1. Same-category swap:例如吉他声换另一段吉他声;
|
| 1009 |
+
2. Cross-category swap:例如吉他声换狗叫或人声。
|
| 1010 |
+
|
| 1011 |
+
只有 cross-category swap 导致显著下降,并且 zeroed audio 与 removed audio 呈现可解释差异,才能更有力证明模型确实使用音频语义。
|
| 1012 |
+
|
| 1013 |
+
---
|
| 1014 |
+
|
| 1015 |
+
### 8.3 First-frame bias / temporal coverage
|
| 1016 |
+
|
| 1017 |
+
| 实验 | 目的 |
|
| 1018 |
+
|---|---|
|
| 1019 |
+
| late-target subset | 目标后半段出现时是否优于 SimToken |
|
| 1020 |
+
| keyframe stride ablation | 分析关键帧覆盖对性能影响 |
|
| 1021 |
+
| partial target subset | 测试目标只在部分帧出现的鲁棒性 |
|
| 1022 |
+
| target disappears subset | 测试 tracking 稳定性 |
|
| 1023 |
+
| GT-visible-frame IoU vs all-frame IoU | 区分目标定位质量和多余 mask 问题 |
|
| 1024 |
+
|
| 1025 |
+
---
|
| 1026 |
+
|
| 1027 |
+
### 8.4 Same-category distractor
|
| 1028 |
+
|
| 1029 |
+
报告:
|
| 1030 |
+
|
| 1031 |
+
- TubeToken vs SimToken;
|
| 1032 |
+
- w/ self-attention vs w/o self-attention;
|
| 1033 |
+
- hard-negative ranking loss ablation;
|
| 1034 |
+
- Selection Acc@1 / Acc@3;
|
| 1035 |
+
- 同类干扰样本上的 error decomposition。
|
| 1036 |
+
|
| 1037 |
+
重点验证 TubeToken 是否减少同类实例混淆。
|
| 1038 |
+
|
| 1039 |
+
---
|
| 1040 |
+
|
| 1041 |
+
### 8.5 Null threshold sensitivity
|
| 1042 |
+
|
| 1043 |
+
虽然 TubeToken 使用 null tube,不需要手工 mask area threshold,但仍需要展示:
|
| 1044 |
+
|
| 1045 |
+
\[
|
| 1046 |
+
p_{exist}=1-P(null)
|
| 1047 |
+
\]
|
| 1048 |
+
|
| 1049 |
+
在不同 threshold 下的:
|
| 1050 |
+
|
| 1051 |
+
- Null FPR;
|
| 1052 |
+
- Positive FNR;
|
| 1053 |
+
- J&F;
|
| 1054 |
+
- Null S;
|
| 1055 |
+
- GT tube Top-3 but Null Top-1 Rate。
|
| 1056 |
+
|
| 1057 |
+
这能说明模型是否对阈值敏感。
|
| 1058 |
+
|
| 1059 |
+
同时比较:
|
| 1060 |
+
|
| 1061 |
+
1. null tube;
|
| 1062 |
+
2. binary existence head;
|
| 1063 |
+
3. mask-area threshold。
|
| 1064 |
+
|
| 1065 |
+
## 9. Efficiency 与公平计算量对比
|
| 1066 |
+
|
| 1067 |
+
Reviewer 会质疑 TubeToken 是否只是计算量换性能,因此必须主动报告效率与等计算量对照。
|
| 1068 |
+
|
| 1069 |
+
### 9.1 需要报告的效率项
|
| 1070 |
+
|
| 1071 |
+
| 项目 | 说明 |
|
| 1072 |
+
|---|---|
|
| 1073 |
+
| Proposal generation time | SAM2 AMG + keyframe processing,按 per video 统计 |
|
| 1074 |
+
| Tracking / propagation time | SAM2 memory propagation |
|
| 1075 |
+
| Tube selection time | conditional compression + selector,按 per expression 统计 |
|
| 1076 |
+
| SAM refinement time | bbox prompt refinement |
|
| 1077 |
+
| Total latency per video | 完整推理耗时,需区分单 expression 与多 expression 场景 |
|
| 1078 |
+
| FPS | 视频级速度 |
|
| 1079 |
+
| Peak GPU memory | 显存 |
|
| 1080 |
+
| MLLM token count | 与 SimToken 比较 |
|
| 1081 |
+
| Number of SAM/SAM2 calls | 计算量透明化 |
|
| 1082 |
+
| Candidate tube number | N=16/32/64/128 |
|
| 1083 |
+
| Keyframe stride | stride=4/8/16 |
|
| 1084 |
+
| Amortized proposal cost per expression | 多 expression 场景下,SAM2 proposal generation 对同一视频只运行一次,在 K 个 expressions 间摊销 |
|
| 1085 |
+
| Per-expression incremental cost | CondQFormer、selector、refinement 对每个 expression 的增量耗时 |
|
| 1086 |
+
|
| 1087 |
+
---
|
| 1088 |
+
|
| 1089 |
+
### 9.2 TubeToken 三种配置
|
| 1090 |
+
|
| 1091 |
+
| 配置 | 默认设置 | 目的 |
|
| 1092 |
+
|---|---|---|
|
| 1093 |
+
| Fast | N=16, stride=16 | 接近 SimToken 计算预算 |
|
| 1094 |
+
| Balanced | N=32, stride=8 | 性能与效率折中 |
|
| 1095 |
+
| Accuracy | N=64 或 128, stride=4 | 追求最好性能 |
|
| 1096 |
+
|
| 1097 |
+
---
|
| 1098 |
+
|
| 1099 |
+
### 9.3 等计算量对比
|
| 1100 |
+
|
| 1101 |
+
必须加入:
|
| 1102 |
+
|
| 1103 |
+
1. **SimToken + matched compute**,固定为 multiple keyframe prompting with the same number of keyframes as TubeToken-Fast;
|
| 1104 |
+
2. **SimToken + SAM2 proposals**;
|
| 1105 |
+
3. **SAM2 proposals + learned reranker(no null tube)**;
|
| 1106 |
+
4. **TubeToken-Fast**。
|
| 1107 |
+
|
| 1108 |
+
报告这些变体在接近 latency / FLOPs / SAM call 数量下的性能。matched compute baseline 的实现必须在实验前固定,不能在实验后根据结果从 multi-scale prompting、multiple decode attempts 等候选方案中挑选。
|
| 1109 |
+
|
| 1110 |
+
若 TubeToken-Fast 显著优于 SimToken + matched compute,则可以有力回应“只是计算量换性能”的质疑。
|
| 1111 |
+
|
| 1112 |
+
### 9.4 多 expression 场景下的 proposal amortization
|
| 1113 |
+
|
| 1114 |
+
若同一视频有 \(K\) 个 referring expressions,TubeToken 的推理成本应拆分为:
|
| 1115 |
+
|
| 1116 |
+
\[
|
| 1117 |
+
C_{video}
|
| 1118 |
+
=
|
| 1119 |
+
C_{proposal}^{video}
|
| 1120 |
+
+
|
| 1121 |
+
K\cdot(C_{cond}^{expr}+C_{select}^{expr}+C_{refine}^{expr})
|
| 1122 |
+
\]
|
| 1123 |
+
|
| 1124 |
+
其中 \(C_{proposal}^{video}\) 是 SAM2 AMG + propagation 的一次性 per-video 成本,不应被错误地重复计算 \(K\) 次。因此需要额外报告:
|
| 1125 |
+
|
| 1126 |
+
| 指标 | 定义 |
|
| 1127 |
+
|---|---|
|
| 1128 |
+
| Proposal cost per video | 同一视频生成 candidate tubes 的一次性成本 |
|
| 1129 |
+
| Amortized proposal cost per expression | \(C_{proposal}^{video}/K\) |
|
| 1130 |
+
| Incremental expression cost | CondQFormer + selector + refinement 的 per-expression 成本 |
|
| 1131 |
+
| Total cost for K expressions | \(C_{proposal}^{video}+K\cdot C_{expr}\) |
|
| 1132 |
+
|
| 1133 |
+
这既避免 Reviewer 误解 TubeToken 每个 expression 都要重跑 SAM2 proposals,也能展示 TubeToken 在多 expression 视频上的潜在效率优势。
|
| 1134 |
+
|
| 1135 |
+
---
|
| 1136 |
+
|
| 1137 |
+
## 10. 主表设计
|
| 1138 |
+
|
| 1139 |
+
### 10.1 Main comparison table
|
| 1140 |
+
|
| 1141 |
+
主表只保留公开 baseline、复现主基线和 TubeToken 主配置,避免把公平性控制变体全部塞入主表导致结构臃肿。公平性控制单独放入 10.2。
|
| 1142 |
+
|
| 1143 |
+
| Method | Seen J&F | Unseen J&F | Mix J&F | Null S | FPS | Memory |
|
| 1144 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 1145 |
+
| EEMC | | | | | | |
|
| 1146 |
+
| TSAM | | | | | | |
|
| 1147 |
+
| SAM2-LOVE | | | | | | |
|
| 1148 |
+
| SimToken official | | | | | | |
|
| 1149 |
+
| SimToken reproduced | | | | | | |
|
| 1150 |
+
| EC-SimToken | | | | | | |
|
| 1151 |
+
| TubeToken-Balanced | | | | | | |
|
| 1152 |
+
| TubeToken-Accuracy | | | | | | |
|
| 1153 |
+
|
| 1154 |
+
---
|
| 1155 |
+
|
| 1156 |
+
### 10.2 Fairness analysis table
|
| 1157 |
+
|
| 1158 |
+
该表专门回答公平性问题:TubeToken 的收益是否来自 SAM2 proposals、learned reranking、null tube 或额外计算量。
|
| 1159 |
+
|
| 1160 |
+
| Method | Matched Proposal? | Matched Compute? | Null Modeling | Seen J&F | Unseen J&F | Mix J&F | Null S | FPS |
|
| 1161 |
+
|---|---|---|---|---:|---:|---:|---:|---:|
|
| 1162 |
+
| SimToken reproduced | No | Base | Implicit / mask output | | | | | |
|
| 1163 |
+
| SimToken + SAM2 proposals zero-param rerank | Yes | No | SimToken implicit | | | | | |
|
| 1164 |
+
| SAM2 proposals + learned reranker(no null tube) | Yes | Partial | threshold / calibrated score | | | | | |
|
| 1165 |
+
| SimToken + matched compute(multiple keyframe prompting) | No | Yes, TubeToken-Fast budget | SimToken implicit | | | | | |
|
| 1166 |
+
| TubeToken-Minimal | Yes | TubeToken-Fast/Balanced reported | learnable null tube | | | | | |
|
| 1167 |
+
| TubeToken-Fast | Yes | Yes | learnable null tube | | | | | |
|
| 1168 |
+
|
| 1169 |
+
---
|
| 1170 |
+
|
| 1171 |
+
### 10.3 Proposal analysis table
|
| 1172 |
+
|
| 1173 |
+
| Split | Recall@16 | Recall@32 | Recall@64 | Oracle Tube J&F | Oracle Refined J&F bbox-only | Proposal Miss % |
|
| 1174 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 1175 |
+
| Seen | | | | | | |
|
| 1176 |
+
| Unseen | | | | | | |
|
| 1177 |
+
| Late-target | | | | | | |
|
| 1178 |
+
| Small/occluded | | | | | | |
|
| 1179 |
+
| Audio-critical | | | | | | |
|
| 1180 |
+
| Multi-expression H3 subset | | | | | | |
|
| 1181 |
+
|
| 1182 |
+
---
|
| 1183 |
+
|
| 1184 |
+
### 10.4 Ablation table
|
| 1185 |
+
|
| 1186 |
+
| Variant | Seen J&F | Unseen J&F | Null S | Selection Acc@1 | Null FPR | GT Top-3 Null Top-1 | FPS |
|
| 1187 |
+
|---|---:|---:|---:|---:|---:|---:|---:|
|
| 1188 |
+
| Full | | | | | | | |
|
| 1189 |
+
| TubeToken-Minimal | | | | | | | |
|
| 1190 |
+
| SAM2 proposals + learned reranker(no null tube) | | | | | | | |
|
| 1191 |
+
| w/o null tube | | | | | | | |
|
| 1192 |
+
| binary existence head | | | | | | | |
|
| 1193 |
+
| mask-area threshold | | | | | | | |
|
| 1194 |
+
| fixed Q-Former | | | | | | | |
|
| 1195 |
+
| text-only cond | | | | | | | |
|
| 1196 |
+
| audio-only cond | | | | | | | |
|
| 1197 |
+
| text+audio cond | | | | | | | |
|
| 1198 |
+
| w/o multi-expression training | | | | | | | |
|
| 1199 |
+
| w/ optional \(\mathcal{L}_{cond}\) | | | | | | | |
|
| 1200 |
+
| w/o self-attn | | | | | | | |
|
| 1201 |
+
| independent scoring | | | | | | | |
|
| 1202 |
+
| w/o refinement | | | | | | | |
|
| 1203 |
+
| bbox+mask prompt | | | | | | | |
|
| 1204 |
+
|
| 1205 |
+
---
|
| 1206 |
+
|
| 1207 |
+
### 10.5 Error decomposition table
|
| 1208 |
+
|
| 1209 |
+
| Split | Proposal Miss | Selection Error | Refinement Error | Null FP | Null FN | GT Top-3 but Null Top-1 |
|
| 1210 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 1211 |
+
| Seen | | | | - | | |
|
| 1212 |
+
| Unseen | | | | - | | |
|
| 1213 |
+
| Null | - | - | - | | - | - |
|
| 1214 |
+
| Same-category | | | | - | | |
|
| 1215 |
+
| Late-target | | | | - | | |
|
| 1216 |
+
| Audio-critical | | | | - | | |
|
| 1217 |
+
| Multi-expression H3 subset | | | | - | | |
|
| 1218 |
+
|
| 1219 |
+
说明:Late-target、Same-category、Audio-critical 通常为正样本子集,因此 Null FP 不适用,用 “-” 标记;若某个子集定义中包含 Null 样本,则需要拆成 positive / null 两行。
|
| 1220 |
+
|
| 1221 |
+
---
|
| 1222 |
+
|
| 1223 |
+
### 10.6 Conditioning analysis table
|
| 1224 |
+
|
| 1225 |
+
| Model | Overall \(\widehat{AC}_{tube}\) | Audio-expression \(\widehat{AC}_{tube}\) | Spatial-expression \(\widehat{AC}_{tube}\) | Same-category \(\widehat{AC}_{tube}\) | Cross-expression CosSim | Selection Acc@1 |
|
| 1226 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 1227 |
+
| Fixed Q-Former | | | | | | |
|
| 1228 |
+
| Text-conditioned | | | | | | |
|
| 1229 |
+
| Audio-conditioned | | | | | | |
|
| 1230 |
+
| Text+Audio-conditioned | | | | | | |
|
| 1231 |
+
|
| 1232 |
+
## 11. 可视化计划
|
| 1233 |
+
|
| 1234 |
+
### 11.1 必做可视化
|
| 1235 |
+
|
| 1236 |
+
1. **Tube selection visualization**
|
| 1237 |
+
展示 top-5 candidate tubes、selector score、最终选择。
|
| 1238 |
+
|
| 1239 |
+
2. **Null case visualization**
|
| 1240 |
+
展示 null tube 得分最高,输出空 mask。
|
| 1241 |
+
|
| 1242 |
+
3. **Same-category distractor**
|
| 1243 |
+
展示两个相似对象,TubeToken 正确选择目标 tube。
|
| 1244 |
+
|
| 1245 |
+
4. **Late-target case**
|
| 1246 |
+
展示目标不在第一帧时,TubeToken 仍能通过 tube 选择找到目标。
|
| 1247 |
+
|
| 1248 |
+
5. **Conditional attention map**
|
| 1249 |
+
同一视频、不同 expression 下,compressor 关注不同 tube/时间片段。
|
| 1250 |
+
|
| 1251 |
+
6. **Attention Concentration visualization**
|
| 1252 |
+
展示 fixed Q-Former 与 conditioned Q-Former 的 attention mass 差异。
|
| 1253 |
+
|
| 1254 |
+
7. **Failure cases**
|
| 1255 |
+
至少展示 proposal miss、selection error、refinement error 三类失败。
|
| 1256 |
+
|
| 1257 |
+
---
|
| 1258 |
+
|
| 1259 |
+
### 11.2 可视化标准
|
| 1260 |
+
|
| 1261 |
+
每个案例应包含:
|
| 1262 |
+
|
| 1263 |
+
- 输入视频关键帧;
|
| 1264 |
+
- expression;
|
| 1265 |
+
- audio waveform 或 audio activity;
|
| 1266 |
+
- candidate tubes;
|
| 1267 |
+
- selection scores;
|
| 1268 |
+
- selected tube;
|
| 1269 |
+
- final mask;
|
| 1270 |
+
- GT mask;
|
| 1271 |
+
- 对应的 error category 或 diagnostic subset 标签。
|
| 1272 |
+
|
| 1273 |
+
---
|
| 1274 |
+
|
| 1275 |
+
## 12. 实施顺序与里程碑
|
| 1276 |
+
|
| 1277 |
+
### Phase -1: 数据审计与 SimToken 复现
|
| 1278 |
+
|
| 1279 |
+
目标:确认 H3 是否具备数据基础,并建立所有 Go/No-Go 判断的主基准。
|
| 1280 |
+
|
| 1281 |
+
交付物:
|
| 1282 |
+
|
| 1283 |
+
- SimToken reproduced result;
|
| 1284 |
+
- reproduced vs official 差异分析;
|
| 1285 |
+
- multi-expression audit;
|
| 1286 |
+
- H3 subset 构建结果;
|
| 1287 |
+
- Null 样本比例与 batch sampling 计划。
|
| 1288 |
+
|
| 1289 |
+
Phase -1 的两个任务可以并行启动:SimToken 复现用于建立所有阈值的主基准,multi-expression audit 用于决定 H3 的叙事强度。
|
| 1290 |
+
|
| 1291 |
+
Go / No-Go 条件:
|
| 1292 |
+
|
| 1293 |
+
| Phase -1 结果 | 建议 |
|
| 1294 |
+
|---|---|
|
| 1295 |
+
| SimToken 复现与官方差异 ≤ 1.5 J&F,且每个视频平均 expression 数 > 1.5 | 按 v4 计划全面推进 Phase 0,H3 保持 P0 级直接验证 |
|
| 1296 |
+
| SimToken 复现与官方差异 ≤ 1.5 J&F,但每个视频基本只有 1 个 expression | 推进 Phase 0,但 H3 direct validation 从 P0 降为 P2,论文采用回退叙事 |
|
| 1297 |
+
| SimToken 复现差异 > 1.5 J&F | 暂停后续实验,先定位复现差异,因为所有 Go/No-Go 阈值都依赖该基准 |
|
| 1298 |
+
|
| 1299 |
+
Phase -1 结束时必须明确说明 H3 属于强验证、弱验证还是叙事回退。
|
| 1300 |
+
|
| 1301 |
+
---
|
| 1302 |
+
|
| 1303 |
+
### Milestone 1: 数据审计与 proposal recall
|
| 1304 |
+
|
| 1305 |
+
目标:判断 TubeToken 是否可行。
|
| 1306 |
+
|
| 1307 |
+
交���物:
|
| 1308 |
+
|
| 1309 |
+
- 数据统计表;
|
| 1310 |
+
- Recall@N;
|
| 1311 |
+
- Oracle Tube J/F;
|
| 1312 |
+
- Oracle Refined J/F bbox-only;
|
| 1313 |
+
- proposal miss 分析;
|
| 1314 |
+
- go / no-go 决策。
|
| 1315 |
+
|
| 1316 |
+
绿灯条件:
|
| 1317 |
+
|
| 1318 |
+
- Recall@32 ≥ 85%;
|
| 1319 |
+
- Oracle Tube J/F ≥ reproduced SimToken J/F + 5%;
|
| 1320 |
+
- Oracle Refined J/F ≥ Oracle Tube J/F + 3%;
|
| 1321 |
+
- Small / occluded subset Recall@32 ≥ 70%。
|
| 1322 |
+
|
| 1323 |
+
黄灯条件:
|
| 1324 |
+
|
| 1325 |
+
- Recall@32 为 80%-85%,但 Oracle Tube J/F 满足绿灯条件:推进但默认 N=64;
|
| 1326 |
+
- Oracle Tube J/F 仅 ≥ SimToken + 2%,但 Oracle Refined J/F ≥ SimToken + 5%:推进但论文重心转向 refinement。
|
| 1327 |
+
|
| 1328 |
+
红灯条件:
|
| 1329 |
+
|
| 1330 |
+
- Recall@64 < 80%;
|
| 1331 |
+
- Oracle Tube J/F ≤ reproduced SimToken J/F;
|
| 1332 |
+
- Recall@32 ≥ 85%,且 Oracle Refined J/F 与 Oracle Tube J/F 差距 < 1%,且 Oracle Tube J/F ≤ reproduced SimToken J/F + 2%;
|
| 1333 |
+
- proposal 对 small / occluded / unseen 存在不可接受的系统性盲区。
|
| 1334 |
+
|
| 1335 |
+
---
|
| 1336 |
+
|
| 1337 |
+
### Milestone 2: TubeToken-Minimal + Fairness Controls
|
| 1338 |
+
|
| 1339 |
+
实现最小版本:
|
| 1340 |
+
|
| 1341 |
+
- SAM2 proposals;
|
| 1342 |
+
- tube construction;
|
| 1343 |
+
- fixed tube feature;
|
| 1344 |
+
- selector + null tube;
|
| 1345 |
+
- no conditional Q-Former;
|
| 1346 |
+
- no SAM refinement。
|
| 1347 |
+
|
| 1348 |
+
同时实现公平性控制:
|
| 1349 |
+
|
| 1350 |
+
1. SimToken + SAM2 proposals 零参数 reranking;
|
| 1351 |
+
2. SAM2 proposals + learned reranker(no null tube);
|
| 1352 |
+
3. SimToken + matched compute;
|
| 1353 |
+
4. w/o null tube + mask-area threshold。
|
| 1354 |
+
|
| 1355 |
+
目标:验证 object tube selection 是否优于 global token baseline,并排除“只是 SAM2 proposals 更强”或“只是计算量更多”的解释。
|
| 1356 |
+
|
| 1357 |
+
绿灯条件:
|
| 1358 |
+
|
| 1359 |
+
- TubeToken-Minimal 的 Seen / Unseen J&F 均优于 reproduced SimToken ≥ 2%;
|
| 1360 |
+
- TubeToken-Minimal 优于 SimToken + SAM2 proposals;
|
| 1361 |
+
- TubeToken-Minimal 的 Null S ≤ SimToken Null S × 1.5;
|
| 1362 |
+
- Tube Selection Acc@1 ≥ 70%。
|
| 1363 |
+
|
| 1364 |
+
黄灯条件:
|
| 1365 |
+
|
| 1366 |
+
- TubeToken-Minimal 优于 SimToken 但不优于 SimToken + SAM2 proposals:说明 proposal 贡献占主导,需要强化 selector 或调整论文叙事;
|
| 1367 |
+
- TubeToken-Minimal 仅在 Null 子集优于 SimToken,Seen / Unseen 持平:继续推进 Milestone 3,但不能把 Minimal 作为主要贡献。
|
| 1368 |
+
|
| 1369 |
+
红灯条件:
|
| 1370 |
+
|
| 1371 |
+
- TubeToken-Minimal 在 Seen / Unseen 均不优于 SimToken,且不优于 SimToken + SAM2 proposals:重新设计 selector 或回退 EC-SimToken。
|
| 1372 |
+
|
| 1373 |
+
---
|
| 1374 |
+
|
| 1375 |
+
### Milestone 3: 加入 Conditional Compression
|
| 1376 |
+
|
| 1377 |
+
实现:
|
| 1378 |
+
|
| 1379 |
+
- fixed Q-Former;
|
| 1380 |
+
- text-conditioned Q-Former;
|
| 1381 |
+
- audio-conditioned Q-Former;
|
| 1382 |
+
- text+audio-conditioned Q-Former;
|
| 1383 |
+
- multi-expression training;
|
| 1384 |
+
- H3 cosine similarity validation。
|
| 1385 |
+
|
| 1386 |
+
目标:证明 conditioning 本身有效,而非 learnable Q-Former 参数量带来的提升。
|
| 1387 |
+
|
| 1388 |
+
必须交付:
|
| 1389 |
+
|
| 1390 |
+
- conditioning ablation;
|
| 1391 |
+
- \(\widehat{AC}_{tube}\);
|
| 1392 |
+
- H3 cross-expression CosSim;
|
| 1393 |
+
- attention visualization;
|
| 1394 |
+
- audio-critical subset 结果;
|
| 1395 |
+
- audio zeroed / removed / shuffled / swapped robustness。
|
| 1396 |
+
|
| 1397 |
+
绿灯条件:
|
| 1398 |
+
|
| 1399 |
+
- Text+Audio conditioned Q-Former 在 Seen / Unseen 均优于 Fixed Q-Former ≥ 1.5%;
|
| 1400 |
+
- \(\widehat{AC}_{tube}\) 在 audio-related expressions 上 conditioned ≥ fixed × 1.3;
|
| 1401 |
+
- 同一视频不同 expression 下,CondQFormer 的 \(\tilde{z}_i\) CosSim 明显低于 Fixed Q-Former;
|
| 1402 |
+
- strict audio-critical subset 上性能提升 ≥ 2%。
|
| 1403 |
+
|
| 1404 |
+
黄灯条件:
|
| 1405 |
+
|
| 1406 |
+
- CondQFormer 整体提升明显,但 \(\widehat{AC}_{tube}\) 差异不显著:论文改述为 learned tube compression;
|
| 1407 |
+
- Text-only 已足够好,Audio conditioning 额外收益 < 0.5%:audio conditioning 改为 robustness improvement,不作为主贡献。
|
| 1408 |
+
|
| 1409 |
+
红灯条件:
|
| 1410 |
+
|
| 1411 |
+
- Fixed Q-Former 与 Text+Audio conditioned Q-Former 差距 < 0.5%,且所有子集无收益:conditioning 无效,考虑 CLIP visual features 或回退论文叙事。
|
| 1412 |
+
|
| 1413 |
+
---
|
| 1414 |
+
|
| 1415 |
+
### Milestone 4: 加入 SAM Refinement
|
| 1416 |
+
|
| 1417 |
+
实现:
|
| 1418 |
+
|
| 1419 |
+
- bbox prompt refinement;
|
| 1420 |
+
- bbox + semantic prompt refinement;
|
| 1421 |
+
- bbox + mask prompt 作为对照。
|
| 1422 |
+
|
| 1423 |
+
目标:证明 refinement 的贡献,并确认默认方案。
|
| 1424 |
+
|
| 1425 |
+
绿灯条件:
|
| 1426 |
+
|
| 1427 |
+
- Bbox prompt refinement 在 J 上优于 w/o refinement ≥ 2%;
|
| 1428 |
+
- Oracle Refined J/F 与实际 TubeToken-Full J/F 的差距 ≤ 10%;
|
| 1429 |
+
- Bbox + mask prompt 不显著优于 bbox-only。
|
| 1430 |
+
|
| 1431 |
+
黄灯条件:
|
| 1432 |
+
|
| 1433 |
+
- Refinement 提升 < 1%:将 SAM refinement 降为 optional module,论文重心转回 tube selection。
|
| 1434 |
+
|
| 1435 |
+
红灯条件:
|
| 1436 |
+
|
| 1437 |
+
- Bbox + mask prompt 显著优于 bbox-only,且差距来自 mask prompt 的 GT-quality dependency:说明 proposal mask 质量不足,需要回到 Milestone 1 改 proposal。
|
| 1438 |
+
|
| 1439 |
+
---
|
| 1440 |
+
|
| 1441 |
+
### Milestone 5: 完整实验与论文分析
|
| 1442 |
+
|
| 1443 |
+
完成:
|
| 1444 |
+
|
| 1445 |
+
- 主表;
|
| 1446 |
+
- 消融;
|
| 1447 |
+
- hard subset;
|
| 1448 |
+
- error decomposition;
|
| 1449 |
+
- efficiency;
|
| 1450 |
+
- equal-compute comparison;
|
| 1451 |
+
- 可视化;
|
| 1452 |
+
- failure case;
|
| 1453 |
+
- reliability diagram / threshold sensitivity。
|
| 1454 |
+
|
| 1455 |
+
## 13. 风险与应对
|
| 1456 |
+
|
| 1457 |
+
| 风险 | 严重程度 | 应对 |
|
| 1458 |
+
|---|---|---|
|
| 1459 |
+
| Ref-AVSBench 缺少 multi-expression 结构 | 极高 | 不将 H3 作为主贡献;叙事回退为 learned tube compression / proposal-conditioned instance grounding |
|
| 1460 |
+
| SimToken 复现与官方数字差异过大 | 高 | 先定位训练、输入、评估差异;所有后续 Go/No-Go 使用 reproduced number |
|
| 1461 |
+
| Multi-expression training 出现梯度冲突 | 中高 | 使用 gradient accumulation 分开累积不同 expression 的梯度;早期采样语义差异较小的 expression pair,稳定后再引入 cross-modality pair |
|
| 1462 |
+
| SimToken + matched compute 实现被质疑 | 高 | 实验前固定为 multiple keyframe prompting with TubeToken-Fast keyframe budget,不保留事后选择空间 |
|
| 1463 |
+
| 多 expression efficiency 被误解为每个 expression 重跑 proposals | 中 | 报告 proposal per-video cost、amortized proposal cost per expression 和 incremental expression cost |
|
| 1464 |
+
|
| 1465 |
+
| Recall@32 低于 80% | 极高 | 增加 proposal 数、引入 detector、使用 hybrid fallback |
|
| 1466 |
+
| Oracle Tube J/F 不高于 reproduced SimToken | 极高 | 暂停 TubeToken 主线,改 refinement、高分辨率特征、proposal 方法或回退 EC-SimToken |
|
| 1467 |
+
| Oracle Refined J/F 定义不公平 | 高 | 固定为 oracle proposal bbox-only,不使用 GT mask prompt |
|
| 1468 |
+
| SimToken + SAM2 proposals 对照过弱 | 高 | 使用零参数 \(F_{seg}\) reranking,并公开公式 |
|
| 1469 |
+
| TubeToken-Minimal 优于 SimToken 但不优于 SimToken + SAM2 proposals | 高 | 说明 proposal 是主要贡献,需强化 tube selector 或调整论文叙事 |
|
| 1470 |
+
| learned reranker 与 TubeToken-Minimal 差距很小 | 中高 | null tube 贡献有限;Null 相关 claim 降级 |
|
| 1471 |
+
| \(\mathcal{L}_{cond}\) 定义不清 | 中高 | 默认删除;若使用则单独定义并做 with/without 消融 |
|
| 1472 |
+
| Null tube 不稳定 | 中高 | 25% Null oversampling + weighted CE curriculum;报告采样比例敏感性 |
|
| 1473 |
+
| Null oversampling 过强导致正样本误判 Null | 高 | 监控 Positive FNR 与 GT Top-3 but Null Top-1 Rate |
|
| 1474 |
+
| conditioning 只带来小幅提升 | 高 | 强化诊断子集、\(\widehat{AC}_{tube}\)、H3 CosSim、fixed Q-Former 对照 |
|
| 1475 |
+
| H3 CosSim 无明显差异 | 高 | 不强调 expression-conditioned summarization;改强调 learned compression 或 selection architecture |
|
| 1476 |
+
| TubeToken 计算量过大 | 高 | 报告 Fast/Balanced/Accuracy 与 matched-compute baseline |
|
| 1477 |
+
| refinement 提升不明显 | 中 | 将重点转向 selection accuracy 与 hard cases;refinement 作为 optional module |
|
| 1478 |
+
| self-attention 无贡献 | 低 | 删除 self-attention,采用更简洁 selector |
|
| 1479 |
+
| attention map 不可解释 | 中高 | 使用 \(\widehat{AC}_{tube}\)、query 分组、H3 CosSim 重新诊断 |
|
| 1480 |
+
| 与 SAM2 工程强绑定 | 中 | 明确核心贡献在 tube-level text/audio selection,不在 proposal generation |
|
| 1481 |
+
|
| 1482 |
+
## 14. 实验优先级
|
| 1483 |
+
|
| 1484 |
+
### P0: 必须完成
|
| 1485 |
+
|
| 1486 |
+
1. SimToken 复现与官方结果差异分析;
|
| 1487 |
+
2. Multi-expression audit;
|
| 1488 |
+
3. Proposal Recall@N;
|
| 1489 |
+
4. Oracle Tube J/F 和 bbox-only Oracle Refined J/F;
|
| 1490 |
+
5. TubeToken-Minimal vs SimToken;
|
| 1491 |
+
6. TubeToken-Minimal vs SimToken + SAM2 proposals;
|
| 1492 |
+
7. SAM2 proposals + learned reranker(no null tube);
|
| 1493 |
+
8. TubeToken-Fast vs SimToken + matched compute(固定为 multiple keyframe prompting);
|
| 1494 |
+
9. Null tube ablation;
|
| 1495 |
+
10. mask-area threshold Null baseline;
|
| 1496 |
+
11. Null oversampling ratio ablation;
|
| 1497 |
+
12. fixed Q-Former vs text+audio conditioned Q-Former;
|
| 1498 |
+
13. \(\widehat{AC}_{tube}\);
|
| 1499 |
+
14. H3 cross-expression CosSim(若 multi-expression audit 支持;否则降为 P2);
|
| 1500 |
+
15. Error decomposition;
|
| 1501 |
+
16. GT Top-3 but Null Top-1 Rate;
|
| 1502 |
+
17. Efficiency table。
|
| 1503 |
+
|
| 1504 |
+
---
|
| 1505 |
+
|
| 1506 |
+
### P1: 强烈建议完成
|
| 1507 |
+
|
| 1508 |
+
1. late-target subset;
|
| 1509 |
+
2. strict audio-critical subset;
|
| 1510 |
+
3. same-category distractor subset;
|
| 1511 |
+
4. threshold sensitivity;
|
| 1512 |
+
5. conditioning attention visualization;
|
| 1513 |
+
6. H3 cross-expression visualization;
|
| 1514 |
+
7. self-attention ablation;
|
| 1515 |
+
8. Reliability Diagram;
|
| 1516 |
+
9. same-category vs cross-category audio swap;
|
| 1517 |
+
10. audio amplitude zeroed, temporal length preserved。
|
| 1518 |
+
|
| 1519 |
+
---
|
| 1520 |
+
|
| 1521 |
+
### P2: 有时间再做
|
| 1522 |
+
|
| 1523 |
+
1. audio shuffled;
|
| 1524 |
+
2. cross-dataset validation, e.g., AVSBench / MeViS;
|
| 1525 |
+
3. frame-level existence;
|
| 1526 |
+
4. open-vocabulary detector assisted proposals;
|
| 1527 |
+
5. manual hard negative benchmark;
|
| 1528 |
+
6. hybrid fallback with EC-SimToken;
|
| 1529 |
+
7. optional \(\mathcal{L}_{cond}\) attention supervision。
|
| 1530 |
+
|
| 1531 |
+
## 15. 预期论文叙事
|
| 1532 |
+
|
| 1533 |
+
### 15.1 正常叙事:H3 成立时
|
| 1534 |
+
|
| 1535 |
+
若 multi-expression audit、multi-expression training、H3 CosSim 和 \(\widehat{AC}_{tube}\) 均支持 H3,建议论文主线写成:
|
| 1536 |
+
|
| 1537 |
+
> Existing Ref-AVS methods often compress multimodal evidence into a global semantic token, implicitly coupling existence judgment, instance grounding, and frame-level segmentation. We find that this implicit coupling becomes fragile in samples requiring instance-level comparison, temporal coverage, explicit null reasoning, or expression-dependent temporal evidence. We therefore formulate Ref-AVS as text-audio conditioned object-tube retrieval followed by mask refinement. Based on this view, we propose TubeToken, which constructs candidate object tubes, summarizes each tube with expression-conditioned temporal evidence, selects the referred tube through multimodal reasoning, handles Null cases via a learnable null tube, and refines the selected tube with SAM.
|
| 1538 |
+
|
| 1539 |
+
Introduction 中建议加入数据驱动的动机,例如:
|
| 1540 |
+
|
| 1541 |
+
- SimToken 在 same-category distractor subset 上下降多少;
|
| 1542 |
+
- SimToken 在 late-target subset 上下降多少;
|
| 1543 |
+
- 去掉 audio 后 audio-critical subset 上下降多少;
|
| 1544 |
+
- Null false positive 是否集中在某类样本;
|
| 1545 |
+
- fixed Q-Former 与 conditioned Q-Former 在 H3 subset 上的 CosSim 差异。
|
| 1546 |
+
|
| 1547 |
+
这能把叙事从“我们认为 global token 不好”升级为“我们用诊断数据证明 global token 有系统性弱点”。
|
| 1548 |
+
|
| 1549 |
+
### 15.2 回退叙事:H3 不强时
|
| 1550 |
+
|
| 1551 |
+
若数据集中 multi-expression 不足,或 conditioned Q-Former 的 H3 CosSim / \(\widehat{AC}_{tube}\) 证据不足,避免强称 “expression-conditioned evidence summarization”。建议改为:
|
| 1552 |
+
|
| 1553 |
+
> We formulate Ref-AVS as proposal-conditioned instance grounding with explicit null reasoning. TubeToken improves robustness by decomposing global segmentation into candidate object tube construction, learned tube selection, null-aware existence modeling, and optional mask refinement.
|
| 1554 |
+
|
| 1555 |
+
此时论文主贡献应改为:
|
| 1556 |
+
|
| 1557 |
+
1. candidate object tube formulation;
|
| 1558 |
+
2. explicit null tube / existence modeling;
|
| 1559 |
+
3. fairness-controlled comparison with SimToken + SAM2 proposals and matched compute;
|
| 1560 |
+
4. diagnostic error decomposition;
|
| 1561 |
+
5. optional learned compression rather than strong conditioning claim。
|
| 1562 |
+
|
| 1563 |
+
## 16. 最小可接受结论标准
|
| 1564 |
+
|
| 1565 |
+
若最终结果满足以下条件,可以支撑一篇完整论文:
|
| 1566 |
+
|
| 1567 |
+
1. SimToken 复现可信,且所有关键比较基于 reproduced SimToken;
|
| 1568 |
+
2. Recall@32 或 Recall@64 足够高,且 Oracle Tube J/F 明确高于 reproduced SimToken,证明 proposal 不是不可接受的瓶颈;
|
| 1569 |
+
3. Oracle Refined J/F 使用 bbox-only prompt,且明确高于 Oracle Tube J/F,证明 refinement 有可达收益;
|
| 1570 |
+
4. TubeToken 在 Seen / Unseen / Mix 不低于 SimToken 超过 2 个点;若主集只持平,必须在 Null、late-target、same-category、audio-critical 子集上有显著提升,并提供效率-鲁棒性-可解释性三维论证;
|
| 1571 |
+
5. TubeToken-Fast 在接近计算预算下优于 SimToken + matched compute(multiple keyframe prompting);
|
| 1572 |
+
6. TubeToken-Minimal 优于 SimToken + SAM2 proposals,证明 tube selection 框架本身有效;
|
| 1573 |
+
7. SAM2 proposals + learned reranker(no null tube)与 TubeToken-Minimal 的对比能解释 selector 与 null tube 的各自贡献;
|
| 1574 |
+
8. fixed Q-Former 明显弱于 text+audio conditioned Q-Former;
|
| 1575 |
+
9. 如果主张 H3,则必须满足:multi-expression audit 支持、multi-expression training 有效、Fixed Q-Former CosSim \(\equiv 1.0\) 而 conditioned CosSim 显著低于 1.0,且 \(\widehat{AC}_{tube}\) 有提升;
|
| 1576 |
+
10. null tube 明显优于 mask-area threshold 和 binary existence head;
|
| 1577 |
+
11. Null oversampling 没有导致 Positive FNR 或 GT Top-3 but Null Top-1 Rate 不可接受地上升;
|
| 1578 |
+
12. error decomposition 能清楚说明主要失败来自 proposal miss、selection error、refinement error、Null FP/FN 还是 Null 校准;
|
| 1579 |
+
13. efficiency 虽然可能更高,但 Fast/Balanced/Accuracy setting 显示计算-性能 trade-off 合理。
|
| 1580 |
+
|
| 1581 |
+
如果第 2 点不成立,应及时回退到 EC-SimToken 路线,避免在低 recall 的 TubeToken 上投入过多。如果第 9 点不成立,应保留 TubeToken 框架,但下调 CondQFormer / H3 的论文权重。
|
| 1582 |
+
|
| 1583 |
+
## 17. 最终执行建议
|
| 1584 |
+
|
| 1585 |
+
推荐按照以下顺序推进:
|
| 1586 |
+
|
| 1587 |
+
1. **先做 Phase -1:SimToken 复现 + multi-expression audit。**
|
| 1588 |
+
这是所有 Go/No-Go 条件和 H3 叙事是否成立的前提。
|
| 1589 |
+
|
| 1590 |
+
2. **再做 Phase 0:proposal recall + bbox-only Oracle Tube / Refined J/F。**
|
| 1591 |
+
这是 TubeToken 能否成立的硬前提,且 Oracle Refined J/F 必须与实际 refinement 设置一致。
|
| 1592 |
+
|
| 1593 |
+
3. **再做 Milestone 2 的 fairness controls。**
|
| 1594 |
+
TubeToken-Minimal、SimToken + SAM2 proposals 零参数 reranking、SAM2 proposals + learned reranker(no null tube)、SimToken + matched compute(multiple keyframe prompting)必须同时完成。
|
| 1595 |
+
|
| 1596 |
+
4. **确认 tube 框架有效后再加入 CondQFormer。**
|
| 1597 |
+
若 multi-expression 数据充足,必须同步加入 multi-expression training 与 H3 CosSim;若不足,则不要把 H3 写成主贡献。
|
| 1598 |
+
|
| 1599 |
+
5. **最后加入 refinement。**
|
| 1600 |
+
refinement 是性能增强项,不应成为论文叙事的唯一支柱。若 bbox-only refinement 提升很小,应将其降为 optional module。
|
| 1601 |
+
|
| 1602 |
+
这一路径可以最大程度降低风险:如果 proposal recall 或 oracle upper bound 不理想,可以及时切回 EC-SimToken;如果 TubeToken-Minimal 已经显示出明显优势,再继续投入完整 TubeToken 是合理的;如果 H3 验证不足,可以保留 tube-level retrieval 贡献,同时修改 CondQFormer 的叙事。
|
| 1603 |
+
|
| 1604 |
+
---
|
| 1605 |
+
|
| 1606 |
+
## Appendix A. Reviewer 建议落地检查表
|
| 1607 |
+
|
| 1608 |
+
| Reviewer 建议 | v3 落地位置 | 状态 |
|
| 1609 |
+
|---|---|---|
|
| 1610 |
+
| 增加 H3 直接验证,不能只用 AC | 1.2, 3.2.6, 8.1.1, 10.5, 12 | 已落实 |
|
| 1611 |
+
| 检查数据集 multi-expression ��构 | 3.1, 3.1.1, Phase -1 | 已落实 |
|
| 1612 |
+
| CondQFormer 显式利用 multi-expression training | 6.3, 12 Milestone 3 | 已落实 |
|
| 1613 |
+
| Go/No-Go 使用 reproduced SimToken,而非不明来源数字 | 4.0, 4.4, 12 | 已落实 |
|
| 1614 |
+
| Oracle Refined J/F 使用 bbox-only prompt,不用 GT mask | 4.2.1, 4.3, 10.2 | 已落实 |
|
| 1615 |
+
| SimToken + SAM2 proposals 使用零参数 reranking | 5.3.1 | 已落实 |
|
| 1616 |
+
| 增加 SAM2 proposals + learned reranker(no null tube) | 5.1, 5.2, 5.3.2, 10.1, 10.3, 12 | 已落实 |
|
| 1617 |
+
| 删除或定义悬空的 \(\mathcal{L}_{cond}\) | 6.2, 6.2.1 | 已落实 |
|
| 1618 |
+
| 明确 Null oversampling 比例 | 6.4, 14 | 已落实 |
|
| 1619 |
+
| 增加 GT Top-3 but Null Top-1 错误类型 | 7.2, 7.3, 10.4 | 已落实 |
|
| 1620 |
+
| 使用标准化 \(\widehat{AC}_{tube}\) | 7.2, 8.1.2, 10.5 | 已落实 |
|
| 1621 |
+
| 增加 audio amplitude zeroed 控制实验 | 8.2, 14 | 已落实 |
|
| 1622 |
+
| 修正 Error decomposition 表 Late-target 缺列 | 10.4 | 已落实 |
|
| 1623 |
+
| Main table 加入 TubeToken-Minimal | 10.1 | 已落实 |
|
| 1624 |
+
| 写入各 Milestone 绿灯 / 黄灯 / 红灯条件 | 12 | 已落实 |
|
| 1625 |
+
| 增加叙事回退方案 | 15.2, 16, 17 | 已落实 |
|
| 1626 |
+
| 固定 SimToken + matched compute 的唯一实现 | 5.3.3, 9.3, 10.2, 12 | v4 已落实 |
|
| 1627 |
+
| 修正 Phase 0 第三条红灯条件为可观测量 | 4.4.3, 12 Milestone 1 | v4 已落实 |
|
| 1628 |
+
| Fixed Q-Former CosSim baseline 精确为 1.0 | 1.2, 8.1.1, 10.6, 16 | v4 已落实 |
|
| 1629 |
+
| 增加 multi-expression training 梯度冲突风险 | 6.3, 13 | v4 已落实 |
|
| 1630 |
+
| 主表精简,公平性控制移入独立表 | 10.1, 10.2 | v4 已落实 |
|
| 1631 |
+
| 增加多 expression proposal amortization efficiency | 9.1, 9.4 | v4 已落实 |
|
| 1632 |
+
| Selection Acc@3 排除 null tube | 7.2 | v4 已落实 |
|
| 1633 |
+
| Error decomposition 使用互斥优先级 | 7.3 | v4 已落实 |
|
| 1634 |
+
| Phase -1 Go/No-Go 明确 SimToken 复现与 H3 audit 分支 | 12 Phase -1 | v4 已落实 |
|
TubeToken_Phase0_Experiment_Log.md
ADDED
|
@@ -0,0 +1,284 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TubeToken Phase -1 / Phase 0 Experiment Log
|
| 2 |
+
|
| 3 |
+
This document records the actual experiment progress, observations, and next actions for the TubeToken v4 plan.
|
| 4 |
+
|
| 5 |
+
## Phase -1 Summary
|
| 6 |
+
|
| 7 |
+
### Data Audit
|
| 8 |
+
|
| 9 |
+
Audit output:
|
| 10 |
+
|
| 11 |
+
```text
|
| 12 |
+
Expressions: 20459
|
| 13 |
+
Videos: 3574
|
| 14 |
+
Objects (vid, fid): 7461
|
| 15 |
+
Splits: val 1349, train 14113, test_s 2288, TODO 25, test_u 1656, test_n 1028
|
| 16 |
+
|
| 17 |
+
Expressions/video mean: 5.724
|
| 18 |
+
Expressions/video median: 6.0
|
| 19 |
+
Videos with >=2 expressions: 3521
|
| 20 |
+
Expressions/object mean: 2.742
|
| 21 |
+
Objects with >=2 expressions: 5836
|
| 22 |
+
H3 candidate objects: 5781
|
| 23 |
+
H3 candidate expressions: 18614
|
| 24 |
+
|
| 25 |
+
Null split expressions: 1028 (5.02%)
|
| 26 |
+
Audio-keyword expressions: 15890 (77.67%)
|
| 27 |
+
Spatial-keyword expressions: 5924 (28.96%)
|
| 28 |
+
Same-category distractor heuristic expressions: 2563 (12.53%)
|
| 29 |
+
Small-target expressions: 10037
|
| 30 |
+
Partial-target expressions: 33
|
| 31 |
+
Area-unstable expressions: 41
|
| 32 |
+
Late-target expressions: 0
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
Decision:
|
| 36 |
+
|
| 37 |
+
- Multi-expression structure is strong.
|
| 38 |
+
- H3 direct validation remains a P0 target.
|
| 39 |
+
- Null modeling is feasible but needs oversampling / curriculum because Null ratio is only about 5%.
|
| 40 |
+
- Small-target proposal recall is a major risk.
|
| 41 |
+
- Late-target subset is not useful under the current GT visibility definition.
|
| 42 |
+
|
| 43 |
+
### SimToken Reproduction
|
| 44 |
+
|
| 45 |
+
Reproduced results:
|
| 46 |
+
|
| 47 |
+
```text
|
| 48 |
+
test_seen:
|
| 49 |
+
mIoU = 0.7189123889
|
| 50 |
+
F = 0.8113823722
|
| 51 |
+
J&F = 0.7651473806
|
| 52 |
+
|
| 53 |
+
test_unseen:
|
| 54 |
+
mIoU = 0.6996124670
|
| 55 |
+
F = 0.7915967433
|
| 56 |
+
J&F = 0.7456046051
|
| 57 |
+
|
| 58 |
+
test_n:
|
| 59 |
+
S = 0.0117917573
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
Paper/report result:
|
| 63 |
+
|
| 64 |
+
```text
|
| 65 |
+
Seen: J 72.0, F 81.3, J&F 76.7
|
| 66 |
+
Unseen: J 69.8, F 79.1, J&F 74.5
|
| 67 |
+
Mix: J 70.9, F 80.2, J&F 75.6
|
| 68 |
+
Null S: 0.012
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
Decision:
|
| 72 |
+
|
| 73 |
+
- SimToken reproduction passes Phase -1.
|
| 74 |
+
- Difference from the report is far below the 1.5 J&F pause threshold.
|
| 75 |
+
- Later Go/No-Go thresholds should use reproduced SimToken as the reference.
|
| 76 |
+
|
| 77 |
+
Working Phase 0 reference:
|
| 78 |
+
|
| 79 |
+
```text
|
| 80 |
+
SimToken seen J&F = 0.7651
|
| 81 |
+
SimToken unseen J&F = 0.7456
|
| 82 |
+
Seen/unseen average = 0.7554
|
| 83 |
+
Target Oracle Tube J&F for green light ~= 0.8054
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
## Phase 0 Proposal Experiments
|
| 87 |
+
|
| 88 |
+
### Implementation Notes
|
| 89 |
+
|
| 90 |
+
Scripts added:
|
| 91 |
+
|
| 92 |
+
```text
|
| 93 |
+
tools/tubetoken/phase0_common.py
|
| 94 |
+
tools/tubetoken/generate_sam2_proposals.py
|
| 95 |
+
tools/tubetoken/evaluate_phase0_proposals.py
|
| 96 |
+
tools/tubetoken/evaluate_oracle_refine_sam2.py
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
SAM2 proposal generation uses:
|
| 100 |
+
|
| 101 |
+
- SAM2 automatic mask generation on keyframes.
|
| 102 |
+
- SAM2 video propagation to form tubes.
|
| 103 |
+
- Cache format: one `.npz` per video with `masks`, `scores`, `keyframes`, and `boxes_xyxy`.
|
| 104 |
+
|
| 105 |
+
Important implementation correction:
|
| 106 |
+
|
| 107 |
+
- Initial unidirectional propagation was invalid for Phase 0 because proposals from later keyframes were not truly propagated backward.
|
| 108 |
+
- Bidirectional propagation was added.
|
| 109 |
+
- Group-by-keyframe propagation was tested but performed slightly worse than shared-state bidirectional propagation on smoke evaluation.
|
| 110 |
+
|
| 111 |
+
### Smoke Results
|
| 112 |
+
|
| 113 |
+
#### Unidirectional Smoke, stride=8, N=128, 5 videos
|
| 114 |
+
|
| 115 |
+
Result:
|
| 116 |
+
|
| 117 |
+
```text
|
| 118 |
+
all: R@16=0.800, R@32=0.900, R@64=1.000, R@128=1.000, Oracle J&F=0.9577
|
| 119 |
+
small: R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9798
|
| 120 |
+
test_s: R@16=0.700, R@32=0.850, R@64=1.000, R@128=1.000, Oracle J&F=0.9743
|
| 121 |
+
test_u: R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9244
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
Interpretation:
|
| 125 |
+
|
| 126 |
+
- Code path worked, but the sample was too small and optimistic.
|
| 127 |
+
|
| 128 |
+
#### Shared-state Bidirectional Smoke, stride=8, N=64, 30 videos
|
| 129 |
+
|
| 130 |
+
Result:
|
| 131 |
+
|
| 132 |
+
```text
|
| 133 |
+
all: n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91%
|
| 134 |
+
audio_keyword: n=130, R@16=0.738, R@32=0.923, R@64=0.977, Oracle J&F=0.9214, miss=2.31%
|
| 135 |
+
h3_candidate: n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91%
|
| 136 |
+
small: n=51, R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9654, miss=0.00%
|
| 137 |
+
spatial_keyword: n=14, R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.9106, miss=0.00%
|
| 138 |
+
test_s: n=43, R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8409, miss=18.60%
|
| 139 |
+
test_u: n=120, R@16=0.750, R@32=0.950, R@64=1.000, Oracle J&F=0.9321, miss=0.00%
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
Interpretation:
|
| 143 |
+
|
| 144 |
+
- Bidirectional propagation fixed the small smoke behavior.
|
| 145 |
+
- However, `test_s` remained much weaker than `test_u`.
|
| 146 |
+
- Full validation was required before making a Phase 0 decision.
|
| 147 |
+
|
| 148 |
+
#### Group-by-keyframe Bidirectional Smoke, stride=8, N=64, 30 videos
|
| 149 |
+
|
| 150 |
+
Result:
|
| 151 |
+
|
| 152 |
+
```text
|
| 153 |
+
all: n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59%
|
| 154 |
+
audio_keyword: n=130, R@16=0.738, R@32=0.877, R@64=0.931, Oracle J&F=0.9138, miss=6.92%
|
| 155 |
+
h3_candidate: n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59%
|
| 156 |
+
small: n=51, R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9695, miss=0.00%
|
| 157 |
+
spatial_keyword: n=14, R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.8945, miss=0.00%
|
| 158 |
+
test_s: n=43, R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8416, miss=18.60%
|
| 159 |
+
test_u: n=120, R@16=0.750, R@32=0.900, R@64=0.950, Oracle J&F=0.9241, miss=5.00%
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
Decision:
|
| 163 |
+
|
| 164 |
+
- Group-by-keyframe is worse than shared-state bidirectional for recall.
|
| 165 |
+
- Use shared-state bidirectional as the current best SAM2 propagation setting.
|
| 166 |
+
|
| 167 |
+
### Full Results: stride=8, N=64
|
| 168 |
+
|
| 169 |
+
Full shared-state bidirectional result:
|
| 170 |
+
|
| 171 |
+
```text
|
| 172 |
+
all: n=3944, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7491, miss=24.62%
|
| 173 |
+
area_unstable: n=18, R@16=0.556, R@32=0.556, R@64=0.889, Oracle J&F=0.7114, miss=11.11%
|
| 174 |
+
audio_keyword: n=2844, R@16=0.475, R@32=0.610, R@64=0.766, Oracle J&F=0.7569, miss=23.42%
|
| 175 |
+
h3_candidate: n=3932, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7488, miss=24.64%
|
| 176 |
+
partial: n=8, R@16=0.250, R@32=0.250, R@64=1.000, Oracle J&F=0.8123, miss=0.00%
|
| 177 |
+
same_category: n=330, R@16=0.482, R@32=0.588, R@64=0.709, Oracle J&F=0.7261, miss=29.09%
|
| 178 |
+
small: n=1631, R@16=0.237, R@32=0.392, R@64=0.633, Oracle J&F=0.6367, miss=36.73%
|
| 179 |
+
spatial_keyword: n=965, R@16=0.331, R@32=0.476, R@64=0.658, Oracle J&F=0.6714, miss=34.20%
|
| 180 |
+
test_s: n=2288, R@16=0.326, R@32=0.483, R@64=0.657, Oracle J&F=0.6674, miss=34.27%
|
| 181 |
+
test_u: n=1656, R@16=0.665, R@32=0.755, R@64=0.887, Oracle J&F=0.8618, miss=11.29%
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
Decision:
|
| 185 |
+
|
| 186 |
+
- `stride=8, N=64` is a Phase 0 red-light configuration.
|
| 187 |
+
- It fails the v4 Go/No-Go criteria:
|
| 188 |
+
- Overall Recall@32 is below 85%.
|
| 189 |
+
- Overall Recall@64 is below 80%.
|
| 190 |
+
- Small-target Recall@32 is far below 70%.
|
| 191 |
+
- Oracle Tube J&F is below the target `SimToken + 5`.
|
| 192 |
+
- `test_s` Oracle J&F is far below reproduced SimToken seen J&F.
|
| 193 |
+
- Do not proceed to TubeToken-Minimal with this proposal cache.
|
| 194 |
+
|
| 195 |
+
Main bottleneck:
|
| 196 |
+
|
| 197 |
+
- Proposal recall, especially for `test_s`, small targets, and spatial expressions.
|
| 198 |
+
- Bidirectional propagation does not solve the full-set miss problem, so the problem is likely candidate generation / ranking / keyframe coverage, not just temporal direction.
|
| 199 |
+
|
| 200 |
+
## Next Experiment
|
| 201 |
+
|
| 202 |
+
### Goal
|
| 203 |
+
|
| 204 |
+
Determine whether the red-light result is caused by top-64 truncation or by missing proposals at generation time.
|
| 205 |
+
|
| 206 |
+
### Step 1: Export R@64 Miss Video List
|
| 207 |
+
|
| 208 |
+
Command:
|
| 209 |
+
|
| 210 |
+
```bash
|
| 211 |
+
cd /workspace/SimToken
|
| 212 |
+
conda activate simtoken
|
| 213 |
+
|
| 214 |
+
python - <<'PY'
|
| 215 |
+
import csv
|
| 216 |
+
from pathlib import Path
|
| 217 |
+
|
| 218 |
+
src = Path("runs/tubetoken_phase0/eval_stride8_n64_bidir/sample_metrics.csv")
|
| 219 |
+
out = Path("runs/tubetoken_phase0/miss_videos_r64.txt")
|
| 220 |
+
|
| 221 |
+
vids = set()
|
| 222 |
+
with src.open() as f:
|
| 223 |
+
for r in csv.DictReader(f):
|
| 224 |
+
if r["recall@64"] != "True":
|
| 225 |
+
vids.add(r["vid"])
|
| 226 |
+
|
| 227 |
+
out.write_text("\n".join(sorted(vids)) + "\n")
|
| 228 |
+
print("miss videos:", len(vids))
|
| 229 |
+
print("wrote:", out)
|
| 230 |
+
PY
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
### Step 2: Test N=128 on Miss Videos
|
| 234 |
+
|
| 235 |
+
Command:
|
| 236 |
+
|
| 237 |
+
```bash
|
| 238 |
+
mkdir -p runs/tubetoken_phase0/proposals_stride8_n128_miss
|
| 239 |
+
|
| 240 |
+
python tools/tubetoken/generate_sam2_proposals.py \
|
| 241 |
+
--data_dir /workspace/SimToken/data \
|
| 242 |
+
--out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride8_n128_miss \
|
| 243 |
+
--video_list /workspace/SimToken/runs/tubetoken_phase0/miss_videos_r64.txt \
|
| 244 |
+
--splits test_s,test_u \
|
| 245 |
+
--sam2_repo /workspace/sam2 \
|
| 246 |
+
--model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \
|
| 247 |
+
--checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \
|
| 248 |
+
--stride 8 \
|
| 249 |
+
--max_tubes 128 \
|
| 250 |
+
--device cuda \
|
| 251 |
+
--amp_dtype bf16 \
|
| 252 |
+
--quiet_sam2 \
|
| 253 |
+
--no_group_by_keyframe \
|
| 254 |
+
2>&1 | tee runs/tubetoken_phase0/proposals_stride8_n128_miss.log
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
Evaluate:
|
| 258 |
+
|
| 259 |
+
```bash
|
| 260 |
+
mkdir -p runs/tubetoken_phase0/eval_stride8_n128_miss
|
| 261 |
+
|
| 262 |
+
python tools/tubetoken/evaluate_phase0_proposals.py \
|
| 263 |
+
--data_dir /workspace/SimToken/data \
|
| 264 |
+
--proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride8_n128_miss \
|
| 265 |
+
--out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride8_n128_miss \
|
| 266 |
+
--audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
|
| 267 |
+
--splits test_s,test_u \
|
| 268 |
+
--video_list /workspace/SimToken/runs/tubetoken_phase0/miss_videos_r64.txt \
|
| 269 |
+
--recall_ns 16,32,64,128 \
|
| 270 |
+
2>&1 | tee runs/tubetoken_phase0/eval_stride8_n128_miss.log
|
| 271 |
+
```
|
| 272 |
+
|
| 273 |
+
Report:
|
| 274 |
+
|
| 275 |
+
```bash
|
| 276 |
+
cat runs/tubetoken_phase0/eval_stride8_n128_miss/report.md
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
+
Expected decision:
|
| 280 |
+
|
| 281 |
+
- If `R@128` on miss videos improves strongly, run full `stride=8, N=128`.
|
| 282 |
+
- If `R@128` remains low, candidate count is not the main issue; next test should increase keyframe coverage with `stride=4`.
|
| 283 |
+
- If `stride=4` remains low, move to detector-assisted proposals or high-resolution proposal generation before TubeToken-Minimal.
|
| 284 |
+
|
__pycache__/load_model.cpython-312.pyc
ADDED
|
Binary file (21.2 kB). View file
|
|
|
load_model.py
CHANGED
|
@@ -208,7 +208,7 @@ def collate_fn(batch, tokenizer=None):
|
|
| 208 |
|
| 209 |
import torch.multiprocessing as mp
|
| 210 |
if __name__ == "__main__":
|
| 211 |
-
mp.set_start_method("spawn")
|
| 212 |
set_seed(42)
|
| 213 |
tokenizer = transformers.AutoTokenizer.from_pretrained(
|
| 214 |
args.mllm,
|
|
@@ -224,14 +224,15 @@ if __name__ == "__main__":
|
|
| 224 |
print("seg_token_idx: ", seg_token_idx)
|
| 225 |
|
| 226 |
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
|
|
|
| 230 |
|
| 231 |
|
| 232 |
-
val_dataloader_s = DataLoader(val_dataset_s, batch_size=1, shuffle=False, num_workers=4, collate_fn=partial(collate_fn, tokenizer=tokenizer))
|
| 233 |
-
|
| 234 |
-
|
| 235 |
|
| 236 |
|
| 237 |
|
|
@@ -449,24 +450,25 @@ if __name__ == "__main__":
|
|
| 449 |
|
| 450 |
for batch in tqdm(dataloader, desc=f"Evaluating on Null"):
|
| 451 |
input_dict = dict_to_cuda(batch)
|
| 452 |
-
with torch.
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
|
| 456 |
-
|
| 457 |
-
|
| 458 |
-
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
|
|
|
|
| 470 |
pred_masks = output_dict["pred_masks"] # list[B]:[num_seg, T, H, W]
|
| 471 |
gt_masks = output_dict["gt_masks"] # list[B]:[num_seg, T, H, W]
|
| 472 |
for i in range(len(pred_masks)):
|
|
@@ -482,9 +484,9 @@ if __name__ == "__main__":
|
|
| 482 |
|
| 483 |
|
| 484 |
|
| 485 |
-
|
| 486 |
-
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
|
| 490 |
-
|
|
|
|
| 208 |
|
| 209 |
import torch.multiprocessing as mp
|
| 210 |
if __name__ == "__main__":
|
| 211 |
+
mp.set_start_method("spawn", force=True)
|
| 212 |
set_seed(42)
|
| 213 |
tokenizer = transformers.AutoTokenizer.from_pretrained(
|
| 214 |
args.mllm,
|
|
|
|
| 224 |
print("seg_token_idx: ", seg_token_idx)
|
| 225 |
|
| 226 |
|
| 227 |
+
eval_splits = {split.strip() for split in args.eval_splits.split(",") if split.strip()}
|
| 228 |
+
val_dataset_s = REFAVS('test_s', args, tokenizer, input_type='refer') if 'test_s' in eval_splits else None
|
| 229 |
+
val_dataset_u = REFAVS('test_u', args, tokenizer, input_type='refer') if 'test_u' in eval_splits else None
|
| 230 |
+
val_dataset_n = REFAVS('test_n', args, tokenizer, input_type='refer') if 'test_n' in eval_splits else None
|
| 231 |
|
| 232 |
|
| 233 |
+
val_dataloader_s = DataLoader(val_dataset_s, batch_size=1, shuffle=False, num_workers=4, collate_fn=partial(collate_fn, tokenizer=tokenizer)) if val_dataset_s is not None else None
|
| 234 |
+
val_dataloader_u = DataLoader(val_dataset_u, batch_size=1, shuffle=False, num_workers=4, collate_fn=partial(collate_fn, tokenizer=tokenizer)) if val_dataset_u is not None else None
|
| 235 |
+
val_dataloader_n = DataLoader(val_dataset_n, batch_size=1, shuffle=False, num_workers=0, collate_fn=partial(collate_fn, tokenizer=tokenizer)) if val_dataset_n is not None else None
|
| 236 |
|
| 237 |
|
| 238 |
|
|
|
|
| 450 |
|
| 451 |
for batch in tqdm(dataloader, desc=f"Evaluating on Null"):
|
| 452 |
input_dict = dict_to_cuda(batch)
|
| 453 |
+
with torch.cuda.amp.autocast(dtype=torch.bfloat16, enabled=True):
|
| 454 |
+
with torch.no_grad():
|
| 455 |
+
output_dict = model.forward(images=input_dict["images"],
|
| 456 |
+
images_clip=input_dict["images_clip"],
|
| 457 |
+
audio_features=input_dict["audio_feats"],
|
| 458 |
+
image_features=input_dict["image_feats"],
|
| 459 |
+
input_ids=input_dict["input_ids"],
|
| 460 |
+
labels=input_dict["labels"],
|
| 461 |
+
attention_masks=input_dict["attention_masks"],
|
| 462 |
+
masks_list=input_dict["masks"],
|
| 463 |
+
resize_list=input_dict["resizes"],
|
| 464 |
+
orgsize_list=input_dict["orgsizes"],
|
| 465 |
+
conversation_list=input_dict["convs"],
|
| 466 |
+
refs_num=input_dict["refs_num"],
|
| 467 |
+
fids=input_dict["fids"],
|
| 468 |
+
vids=input_dict["vids"],
|
| 469 |
+
contrast=args.ct_weight,
|
| 470 |
+
ref_ids=input_dict["ref_ids"],
|
| 471 |
+
inference=True)
|
| 472 |
pred_masks = output_dict["pred_masks"] # list[B]:[num_seg, T, H, W]
|
| 473 |
gt_masks = output_dict["gt_masks"] # list[B]:[num_seg, T, H, W]
|
| 474 |
for i in range(len(pred_masks)):
|
|
|
|
| 484 |
|
| 485 |
|
| 486 |
|
| 487 |
+
if val_dataloader_s is not None:
|
| 488 |
+
valuate(model, val_dataloader_s, 'test_seen')
|
| 489 |
+
if val_dataloader_u is not None:
|
| 490 |
+
valuate(model, val_dataloader_u, 'test_unseen')
|
| 491 |
+
if val_dataloader_n is not None:
|
| 492 |
+
valuate_Null(model, val_dataloader_n)
|
runs/tubetoken_phase0/eval_stride8_n64_bidir/report.md
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TubeToken Phase 0 Proposal Evaluation
|
| 2 |
+
|
| 3 |
+
- all: n=3944, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7491, miss=24.62%
|
| 4 |
+
- area_unstable: n=18, R@16=0.556, R@32=0.556, R@64=0.889, Oracle J&F=0.7114, miss=11.11%
|
| 5 |
+
- audio_keyword: n=2844, R@16=0.475, R@32=0.610, R@64=0.766, Oracle J&F=0.7569, miss=23.42%
|
| 6 |
+
- h3_candidate: n=3932, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7488, miss=24.64%
|
| 7 |
+
- partial: n=8, R@16=0.250, R@32=0.250, R@64=1.000, Oracle J&F=0.8123, miss=0.00%
|
| 8 |
+
- same_category: n=330, R@16=0.482, R@32=0.588, R@64=0.709, Oracle J&F=0.7261, miss=29.09%
|
| 9 |
+
- small: n=1631, R@16=0.237, R@32=0.392, R@64=0.633, Oracle J&F=0.6367, miss=36.73%
|
| 10 |
+
- spatial_keyword: n=965, R@16=0.331, R@32=0.476, R@64=0.658, Oracle J&F=0.6714, miss=34.20%
|
| 11 |
+
- test_s: n=2288, R@16=0.326, R@32=0.483, R@64=0.657, Oracle J&F=0.6674, miss=34.27%
|
| 12 |
+
- test_u: n=1656, R@16=0.665, R@32=0.755, R@64=0.887, Oracle J&F=0.8618, miss=11.29%
|
runs/tubetoken_phase0/eval_stride8_n64_bidir/sample_metrics.csv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
runs/tubetoken_phase0/eval_stride8_n64_bidir/summary.json
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"all": {
|
| 3 |
+
"count": 3944,
|
| 4 |
+
"oracle_f": 0.7780505165622835,
|
| 5 |
+
"oracle_iou_all": 0.7200560851848016,
|
| 6 |
+
"oracle_iou_visible": 0.7204684844627691,
|
| 7 |
+
"oracle_j": 0.7200560864466854,
|
| 8 |
+
"oracle_jf": 0.749053301504484,
|
| 9 |
+
"proposal_miss": 971,
|
| 10 |
+
"proposal_miss_percent": 24.61967545638945,
|
| 11 |
+
"recall@16": 0.4685598377281947,
|
| 12 |
+
"recall@32": 0.5973630831643002,
|
| 13 |
+
"recall@64": 0.7538032454361054
|
| 14 |
+
},
|
| 15 |
+
"area_unstable": {
|
| 16 |
+
"count": 18,
|
| 17 |
+
"oracle_f": 0.7769002698555225,
|
| 18 |
+
"oracle_iou_all": 0.6459361637632052,
|
| 19 |
+
"oracle_iou_visible": 0.641666577094131,
|
| 20 |
+
"oracle_j": 0.6459361736374968,
|
| 21 |
+
"oracle_jf": 0.7114182217465094,
|
| 22 |
+
"proposal_miss": 2,
|
| 23 |
+
"proposal_miss_percent": 11.11111111111111,
|
| 24 |
+
"recall@16": 0.5555555555555556,
|
| 25 |
+
"recall@32": 0.5555555555555556,
|
| 26 |
+
"recall@64": 0.8888888888888888
|
| 27 |
+
},
|
| 28 |
+
"audio_keyword": {
|
| 29 |
+
"count": 2844,
|
| 30 |
+
"oracle_f": 0.7842819413589385,
|
| 31 |
+
"oracle_iou_all": 0.7295879604172519,
|
| 32 |
+
"oracle_iou_visible": 0.7293891077744052,
|
| 33 |
+
"oracle_j": 0.7295879614708254,
|
| 34 |
+
"oracle_jf": 0.756934951414882,
|
| 35 |
+
"proposal_miss": 666,
|
| 36 |
+
"proposal_miss_percent": 23.417721518987342,
|
| 37 |
+
"recall@16": 0.47468354430379744,
|
| 38 |
+
"recall@32": 0.610056258790436,
|
| 39 |
+
"recall@64": 0.7658227848101266
|
| 40 |
+
},
|
| 41 |
+
"h3_candidate": {
|
| 42 |
+
"count": 3932,
|
| 43 |
+
"oracle_f": 0.7777484301907281,
|
| 44 |
+
"oracle_iou_all": 0.7197934413123055,
|
| 45 |
+
"oracle_iou_visible": 0.7202070991842038,
|
| 46 |
+
"oracle_j": 0.7197934425788871,
|
| 47 |
+
"oracle_jf": 0.7487709363848074,
|
| 48 |
+
"proposal_miss": 969,
|
| 49 |
+
"proposal_miss_percent": 24.643947100712104,
|
| 50 |
+
"recall@16": 0.4687182095625636,
|
| 51 |
+
"recall@32": 0.5974059003051883,
|
| 52 |
+
"recall@64": 0.753560528992879
|
| 53 |
+
},
|
| 54 |
+
"partial": {
|
| 55 |
+
"count": 8,
|
| 56 |
+
"oracle_f": 0.8269168466522676,
|
| 57 |
+
"oracle_iou_all": 0.7977360785007477,
|
| 58 |
+
"oracle_iou_visible": 0.6766794174909592,
|
| 59 |
+
"oracle_j": 0.7977360699530978,
|
| 60 |
+
"oracle_jf": 0.8123264583026827,
|
| 61 |
+
"proposal_miss": 0,
|
| 62 |
+
"proposal_miss_percent": 0.0,
|
| 63 |
+
"recall@16": 0.25,
|
| 64 |
+
"recall@32": 0.25,
|
| 65 |
+
"recall@64": 1.0
|
| 66 |
+
},
|
| 67 |
+
"same_category": {
|
| 68 |
+
"count": 330,
|
| 69 |
+
"oracle_f": 0.7644448335532433,
|
| 70 |
+
"oracle_iou_all": 0.6878195029645943,
|
| 71 |
+
"oracle_iou_visible": 0.6874837929881668,
|
| 72 |
+
"oracle_j": 0.687819501173607,
|
| 73 |
+
"oracle_jf": 0.7261321673634256,
|
| 74 |
+
"proposal_miss": 96,
|
| 75 |
+
"proposal_miss_percent": 29.09090909090909,
|
| 76 |
+
"recall@16": 0.4818181818181818,
|
| 77 |
+
"recall@32": 0.5878787878787879,
|
| 78 |
+
"recall@64": 0.7090909090909091
|
| 79 |
+
},
|
| 80 |
+
"small": {
|
| 81 |
+
"count": 1631,
|
| 82 |
+
"oracle_f": 0.6917960314676159,
|
| 83 |
+
"oracle_iou_all": 0.581673126376625,
|
| 84 |
+
"oracle_iou_visible": 0.5810682598301485,
|
| 85 |
+
"oracle_j": 0.5816731270979948,
|
| 86 |
+
"oracle_jf": 0.6367345792828051,
|
| 87 |
+
"proposal_miss": 599,
|
| 88 |
+
"proposal_miss_percent": 36.72593500919681,
|
| 89 |
+
"recall@16": 0.23666462293071736,
|
| 90 |
+
"recall@32": 0.3917841814837523,
|
| 91 |
+
"recall@64": 0.6327406499080319
|
| 92 |
+
},
|
| 93 |
+
"spatial_keyword": {
|
| 94 |
+
"count": 965,
|
| 95 |
+
"oracle_f": 0.715782608316947,
|
| 96 |
+
"oracle_iou_all": 0.6269581168444804,
|
| 97 |
+
"oracle_iou_visible": 0.6278982803011947,
|
| 98 |
+
"oracle_j": 0.6269581183310804,
|
| 99 |
+
"oracle_jf": 0.6713703633240147,
|
| 100 |
+
"proposal_miss": 330,
|
| 101 |
+
"proposal_miss_percent": 34.196891191709845,
|
| 102 |
+
"recall@16": 0.3305699481865285,
|
| 103 |
+
"recall@32": 0.47564766839378236,
|
| 104 |
+
"recall@64": 0.6580310880829016
|
| 105 |
+
},
|
| 106 |
+
"test_s": {
|
| 107 |
+
"count": 2288,
|
| 108 |
+
"oracle_f": 0.7064674157375836,
|
| 109 |
+
"oracle_iou_all": 0.628373636981925,
|
| 110 |
+
"oracle_iou_visible": 0.6283693503971877,
|
| 111 |
+
"oracle_j": 0.6283736383630024,
|
| 112 |
+
"oracle_jf": 0.6674205270502909,
|
| 113 |
+
"proposal_miss": 784,
|
| 114 |
+
"proposal_miss_percent": 34.26573426573427,
|
| 115 |
+
"recall@16": 0.32604895104895104,
|
| 116 |
+
"recall@32": 0.4833916083916084,
|
| 117 |
+
"recall@64": 0.6573426573426573
|
| 118 |
+
},
|
| 119 |
+
"test_u": {
|
| 120 |
+
"count": 1656,
|
| 121 |
+
"oracle_f": 0.8769527718080132,
|
| 122 |
+
"oracle_iou_all": 0.8467284532332201,
|
| 123 |
+
"oracle_iou_visible": 0.8477165634132825,
|
| 124 |
+
"oracle_j": 0.8467284543304232,
|
| 125 |
+
"oracle_jf": 0.8618406130692231,
|
| 126 |
+
"proposal_miss": 187,
|
| 127 |
+
"proposal_miss_percent": 11.292270531400966,
|
| 128 |
+
"recall@16": 0.6654589371980676,
|
| 129 |
+
"recall@32": 0.7548309178743962,
|
| 130 |
+
"recall@64": 0.8870772946859904
|
| 131 |
+
}
|
| 132 |
+
}
|
runs/tubetoken_phase0/proposals_stride8_n64_bidir/manifest.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
upload.log
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|