Add files using upload-large-folder tool

Browse files

Files changed (11) hide show

README.md +1 -19
SimToken_Setup_Upload_Download_Guide.md +182 -0
TubeToken_Experiment_Plan_v4_Final.md +1634 -0
TubeToken_Phase0_Experiment_Log.md +284 -0
__pycache__/load_model.cpython-312.pyc +0 -0
load_model.py +33 -31
runs/tubetoken_phase0/eval_stride8_n64_bidir/report.md +12 -0
runs/tubetoken_phase0/eval_stride8_n64_bidir/sample_metrics.csv +0 -0
runs/tubetoken_phase0/eval_stride8_n64_bidir/summary.json +132 -0
runs/tubetoken_phase0/proposals_stride8_n64_bidir/manifest.json +0 -0
upload.log +0 -0

README.md CHANGED Viewed

@@ -23,41 +23,30 @@ Download the official Ref-AVSBench dataset from [here](https://github.com/GeWu-L
 ### Pretrained Backbones
 Download the sam_vit_h_4b8939.pth and put it in ```./models/segment_anything```
 ### Checkpoints
 Download our pretrained  **[Simtoken](https://drive.google.com/file/d/1pargYfFy93rymCANuWV0nt6Lx3Ri406l/view?usp=sharing)**.
 ### Core Requirements
 This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
 - `numpy`, `pandas`, `matplotlib`, `opencv`
 - `einops`, `timm`
 - `sentencepiece`
 - `transformers`, `peft`
 Newer versions of `transformers` and `peft` may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
 To avoid such compatibility issues, we recommend **not using overly recent versions** and pin the two packages to the versions used during our development:
 - `transformers==4.30.2`
 - `peft==0.2.0`
 We also provide a complete requirements.txt for reference and easier reproduction:
 ```
 pip install -r requirements.txt
 ```
 ---
 ## 📌 Getting Started
 ### Preparation
 We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
 ```
 python save_audio_feats.py --data_dir 'path/to/data'
 python save_sam_feats.py  --data_dir 'path/to/data'
 ```
 ### Train
 To train our model on Ref-AVS Bench:
 ```
@@ -68,7 +57,6 @@ python -W ignore train.py --name 'xxx' \
     --data_dir 'path/to/data'\
     --log_root 'path/to/log_root'\
     --checkpoint_root 'path/to/checkpoints_root'
 ```
 ### Test
 To test our pretrained simtoken:
@@ -79,10 +67,4 @@ python -W ignore load_model.py  --saved_model 'path/to/checkpoint.pth' \
     --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
     --data_dir 'path/to/data' \
     --visualization_root 'path/to/visualization_root'
-```

 ### Pretrained Backbones
 Download the sam_vit_h_4b8939.pth and put it in ```./models/segment_anything```
 ### Checkpoints
 Download our pretrained  **[Simtoken](https://drive.google.com/file/d/1pargYfFy93rymCANuWV0nt6Lx3Ri406l/view?usp=sharing)**.
 ### Core Requirements
 This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
 - `numpy`, `pandas`, `matplotlib`, `opencv`
 - `einops`, `timm`
 - `sentencepiece`
 - `transformers`, `peft`
 Newer versions of `transformers` and `peft` may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
 To avoid such compatibility issues, we recommend **not using overly recent versions** and pin the two packages to the versions used during our development:
 - `transformers==4.30.2`
 - `peft==0.2.0`
 We also provide a complete requirements.txt for reference and easier reproduction:
 ```
 pip install -r requirements.txt
 ```
 ---
 ## 📌 Getting Started
 ### Preparation
 We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
 ```
 python save_audio_feats.py --data_dir 'path/to/data'
 python save_sam_feats.py  --data_dir 'path/to/data'
 ```
 ### Train
 To train our model on Ref-AVS Bench:
 ```
     --data_dir 'path/to/data'\
     --log_root 'path/to/log_root'\
     --checkpoint_root 'path/to/checkpoints_root'
 ```
 ### Test
 To test our pretrained simtoken:
     --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
     --data_dir 'path/to/data' \
     --visualization_root 'path/to/visualization_root'
+```

SimToken_Setup_Upload_Download_Guide.md ADDED Viewed

	@@ -0,0 +1,182 @@

+# SimToken Setup, Data, Upload, and Download Guide
+This guide is for moving the SimToken workspace between rented servers.
+Assumed paths:
+```bash
+PROJECT_ROOT=/workspace/SimToken
+SAM2_ROOT=/workspace/sam2
+HF_REPO=yfan07/SimToken
+```
+## 1. Environment Setup
+```bash
+conda create -n simtoken python=3.10 -y
+conda activate simtoken
+conda install -c conda-forge ffmpeg libsndfile git git-lfs wget -y
+git lfs install
+pip install --upgrade pip setuptools wheel
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
+```
+If CUDA 12.6 wheels are unavailable, use CUDA 12.1 wheels:
+```bash
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+```
+Install SimToken dependencies:
+```bash
+pip install \
+  numpy pandas matplotlib opencv-python pillow tqdm einops timm sentencepiece \
+  transformers==4.30.2 peft==0.2.0 accelerate safetensors huggingface-hub \
+  packaging regex requests psutil gdown
+```
+Optional, only needed if regenerating audio features:
+```bash
+pip install towhee towhee.models
+```
+## 2. Repository Download
+```bash
+cd /workspace
+huggingface-cli login
+huggingface-cli download yfan07/SimToken \
+  --repo-type model \
+  --local-dir /workspace/SimToken \
+  --local-dir-use-symlinks False
+```
+## 3. Model Preparation
+### SAM for SimToken
+```bash
+mkdir -p /workspace/SimToken/models/segment_anything
+cd /workspace/SimToken/models/segment_anything
+wget -O sam_vit_h_4b8939.pth \
+  https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+```
+### SimToken Checkpoint
+```bash
+mkdir -p /workspace/SimToken/checkpoints
+gdown 'https://drive.google.com/uc?id=1pargYfFy93rymCANuWV0nt6Lx3Ri406l' \
+  -O /workspace/SimToken/checkpoints/simtoken_pretrained.pth
+```
+### Hugging Face Models
+```bash
+mkdir -p /workspace/hf_models
+huggingface-cli download openai/clip-vit-large-patch14 \
+  --local-dir /workspace/hf_models/clip-vit-large-patch14 \
+  --local-dir-use-symlinks False
+huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5 \
+  --local-dir /workspace/hf_models/Chat-UniVi-7B-v1.5 \
+  --local-dir-use-symlinks False
+```
+### SAM2 for TubeToken Proposals
+Put SAM2 under `/workspace/sam2`:
+```bash
+cd /workspace
+git clone https://github.com/facebookresearch/sam2.git
+cd /workspace/sam2
+pip install -e .
+```
+Download SAM2.1 checkpoints:
+```bash
+cd /workspace/sam2/checkpoints
+bash download_ckpts.sh
+```
+The TubeToken Phase 0 commands use:
+```text
+/workspace/sam2/checkpoints/sam2.1_hiera_large.pt
+/workspace/sam2/sam2/configs/sam2.1/sam2.1_hiera_l.yaml
+```
+## 4. Dataset Preparation
+Runtime layout:
+```text
+/workspace/SimToken/data
+  metadata.csv
+  media/
+  gt_mask/
+  audio_embed/
+  image_embed/
+```
+Package the four data directories:
+```bash
+cd /workspace/SimToken/data
+tar -cf media.tar media
+tar -czf gt_mask.tar.gz gt_mask
+tar -czf audio_embed.tar.gz audio_embed
+tar -cf image_embed.tar image_embed
+```
+Restore the four data directories:
+```bash
+cd /workspace/SimToken/data
+tar -xf media.tar
+tar -xzf gt_mask.tar.gz
+tar -xzf audio_embed.tar.gz
+tar -xf image_embed.tar
+```
+## 5. Upload Repository
+Use one full-directory upload command:
+```bash
+cd /workspace/SimToken
+huggingface-cli login
+huggingface-cli upload yfan07/SimToken . . \
+  --repo-type model \
+  2>&1 | tee upload.log
+```
+This uploads the whole `/workspace/SimToken` directory according to the current local files.
+## 6. Current Experiment Files to Preserve
+Keep these files and directories for continuing TubeToken experiments:
+```text
+runs/tubetoken_phase_minus1/audit_full
+runs/tubetoken_phase_minus1/simtoken_eval
+runs/tubetoken_phase0/proposals_stride8_n64_bidir
+runs/tubetoken_phase0/eval_stride8_n64_bidir
+runs/tubetoken_phase0/miss_videos_r64.txt
+TubeToken_Phase0_Experiment_Log.md
+TubeToken_Experiment_Plan_v4_Final.md
+```

TubeToken_Experiment_Plan_v4_Final.md ADDED Viewed

	@@ -0,0 +1,1634 @@

+# TubeToken 实验计划 v4（Final / Experiment-Ready）
+> 主线：以 **TubeToken** 为核心框架，将 **Existence / Null 建模** 与 **Text-Audio Conditional Compression** 作为 TubeToken 的自然组成部分，而不是作为 SimToken 的外接补丁。
+> v4 目标：在 v3 Reviewer-Revised 的基础上完成最后一轮实验前定稿，固定 matched-compute baseline 的实现，修正 Phase 0 红灯条件，精确化 H3 CosSim baseline，补充 multi-expression training 的梯度冲突风险，重构主表与公平性分析表，并明确多 expression 场景下的 proposal amortization efficiency。
+---
+## 0. v4 最终修改摘要
+本版是实验启动前的最终方案。v3 已经具备启动实验的完整框架；v4 只做定稿级别的精修，重点消除可能导致后期 Reviewer 质疑或实验返工的模糊点。
+相较 v3，v4 做了以下最终修改：
+1. **固定 SimToken + matched compute 的唯一实现**：不再保留四个候选方案，明确使用 **SimToken + multiple keyframe prompting with the same number of keyframes as TubeToken-Fast**。该对照在概念上最接近 TubeToken-Fast 的额外计算来源，也避免实验结束后选择有利 baseline 的嫌疑。
+2. **修正 Milestone 1 第三条红灯条件**：删除 “预计 TubeToken-Minimal 无法获得 selection 收益” 这类 Phase 0 不可观测判断，改为完全基于 Phase 0 可测量量：Recall@32、Oracle Tube J/F、Oracle Refined J/F。
+3. **精确化 Fixed Q-Former 的 H3 CosSim baseline**：Fixed Q-Former 对同一 tube 的不同 expression 输出完全相同，因此 cross-expression CosSim **恒等于 1.0**，不是“接近 1”。Conditioned Q-Former 是否显著低于 1.0 是 H3 的直接证据。
+4. **补充 multi-expression training 的梯度冲突风险与缓解方案**：若不同 expressions 对同一 tube 要求矛盾的 temporal / audio / spatial attention，可使用 gradient accumulation 分开累积，或先采样语义差异较小的 expression pair。
+5. **重构主表为顶会友好格式**：主表精简为 8 行，只保留主要公开 baseline 与 TubeToken 主配置；SimToken + SAM2 proposals、learned reranker、matched compute、TubeToken-Minimal、TubeToken-Fast 移入独立 Fairness Analysis Table。
+6. **在 efficiency 中明确 per-video 与 per-expression 成本**：SAM2 proposals 是 per video 一次性成本；在同一视频有 K 个 expressions 时，proposal cost 可在 expressions 间摊销，CondQFormer 与 selector 才是 per-expression 成本。
+7. **澄清 Selection Acc@3 对 null tube 的处理**：正样本计算 object-level Top-3 时排除 null tube；“GT tube Top-3 but Null Top-1” 作为独立 null calibration 指标在全 ranking 中计算。
+8. **明确 error decomposition 的互斥优先级**：每个失败样本只归入一个错误类别，按 Proposal miss → Null FN with GT Top-3 → Null FN without GT Top-3 → Selection error → Refinement error → Null FP 的优先级判定。
+9. **更新 Phase -1 Go/No-Go 标准**：SimToken 复现与 multi-expression audit 可并行启动；若 SimToken 复现差异 > 1.5 J&F，则暂停后续实验；若 multi-expression 不足，则将 H3 direct validation 从 P0 降为 P2 并采用回退叙事。
+10. **更新 Appendix 检查表**：把最终 Reviewer 精修建议全部纳入落地状态，形成实验前 checklist。
+## 1. 核心研究假设
+### 1.1 任务重述
+Referring Audio-Visual Segmentation, Ref-AVS, 不应仅被建模为：
+\[
+\text{MLLM} \rightarrow \langle SEG \rangle \rightarrow \text{SAM}
+\]
+而应被建模为：
+\[
+\text{Candidate Object Tubes}
+\rightarrow
+\text{Text-Audio Conditioned Tube Selection}
+\rightarrow
+\text{SAM Refinement}
+\]
+也就是说，Ref-AVS 的本质更接近 **object-level retrieval + mask refinement**：
+1. 视频中有哪些候选对象实例？
+2. 哪一个对象实例被文本和音频共同指代？
+3. 如果没有符合条件的对象，模型能否显式选择 Null？
+4. 选中的对象 tube 是否能被进一步精修为高质量 mask？
+### 1.2 主要假设
+**H1: Object tube 是比 global `<SEG>` token 更适合 Ref-AVS 的中间表示。**
+Tube 可以显式保持跨帧身份一致性，降低同类多实例、遮挡、出入画面情况下的 identity switch 风险。
+**H2: Null / Existence 应该通过显式候选建模解决。**
+TubeToken 中引入 learnable null tube，将 Null 判断转化为候选选择问题，而不是依赖 SAM decoder 被动输出空 mask。
+**H3: 同一 candidate tube 在不同 referring expression 下应暴露不同的时序证据，因此 tube 表征必须由 text/audio condition 动态调制。**
+在 TubeToken 中，conditional compression 不是全视频 token pooling 的替代品，而是 **tube-level evidence summarization**。同一 object tube 对于不同表达可能需要关注不同帧、不同动作、不同音频片段或不同空间关系。
+**H3 的成立前提与验证要求：**
+1. 数据层面必须先确认 Ref-AVSBench 中是否存在多个 expression 指向同一视频或同一目标。
+2. 若存在 multi-expression 结构，训练阶段必须显式利用它：对同一视频 / 同一 tube 使用至少两个不同 expressions 进行 forward pass，共享 proposals，但使用不同 conditional queries。
+3. 验证 H3 时不能只报告 AC。AC 只能证明模型是否关注正确区域 / 正确 tube，不能证明同一 tube 在不同 expression 下产生了差异化证据摘要。
+4. H3 的直接验证指标是：同一视频、同一 matched GT tube、不同 expression 下 \(\tilde{z}_i\) 的 cosine similarity。Fixed Q-Former 因为不依赖 expression，对同一 tube 的不同 expression 输出完全相同，CosSim \(\equiv 1.0\)；conditioned Q-Former 的 similarity 应显著低于 1.0，并且 selection performance 不下降。
+5. 若数据审计发现每个视频平均只有一个 expression，则 H3 不作为主贡献，论文主线应回退为 “proposal-conditioned instance grounding + explicit null reasoning”。
+**H4: TubeToken 的收益必须通过 proposal recall、oracle upper bound、selection accuracy、refinement quality 和 efficiency breakdown 分别解释。**
+不能只报告最终 J/F/S，否则无法回答性能提升来自哪里，也无法判断瓶颈位于 proposal、selection 还是 refinement。
+**H5: TubeToken 的提升必须在公平计算量和公平 proposal 条件下仍然成立。**
+必须通过 SimToken + SAM2 proposals、SimToken + matched compute、SAM2 proposals + learned reranker（no null tube）等对照排除 “只是 SAM2 proposal 更强” 或 “只是计算量更多” 的解释。
+## 2. 方法版本定义
+### 2.1 TubeToken-Full
+完整方法包含四个阶段。
+---
+### Stage 1: Candidate tube generation
+在关键帧上使用 SAM2 automatic mask generation 产生候选 masks，并用 SAM2 tracking / memory 机制向前后帧传播，得到候选 object tubes：
+\[
+\mathcal{O} = \{o_1, o_2, \dots, o_N\}
+\]
+每个 tube：
+\[
+o_i = \{m_{i,t}, b_{i,t}, f_{i,t}\}_{t=1}^{T}
+\]
+其中：
+- \(m_{i,t}\)：第 \(t\) 帧 mask；
+- \(b_{i,t}\)：第 \(t\) 帧 bbox；
+- \(f_{i,t}\)：mask-pooled visual feature。
+**实现约定**：
+默认在关键帧上运行 SAM2 AMG，在非关键帧上使用 SAM2 propagation，而不是每帧重新运行 AMG。这样可以避免 proposal 阶段计算量过高。
+---
+### Stage 2: Text-audio conditioned tube representation
+文本表达编码为 \(e_{text}\)，音频编码为 \(e_{audio}\)。构造条件化 query：
+\[
+Q = Q_0 + W_t e_{text} + W_a e_{audio} + W_{ta}(e_{text} \odot e_{audio})
+\]
+对每个 tube 的时序特征 \(\{f_{i,t}\}_{t=1}^{T}\) 进行条件化压缩：
+\[
+\tilde{z}_i = \text{CondQFormer}(Q, \{f_{i,t}\}_{t=1}^{T})
+\]
+该模块的目标不是单纯减少 token 数，而是让同一 tube 在不同 expression 下形成不同的证据摘要。
+**v3 约束：** 如果数据集中存在多 expression 样本，Stage 2 的训练必须在 batch 内显式包含同一视频 / 同一 tube 的不同 expression forward pass。否则 H3 只能作为推理假设，不能作为强实验证明。
+#### 2.2 特征来源说明
+默认设定：
+\[
+f_{i,t} = \text{MaskPool}(\text{SAM2ImageEncoder}(I_t), m_{i,t})
+\]
+也就是说，Stage 2 复用 SAM2 image encoder 特征，不额外引入独立 ViT 或 CLIP visual encoder。这样有三个好处：
+1. proposal generation 与 tube representation 使用一致的视觉特征；
+2. 避免额外视觉 encoder 带来的计算量和公平性争议；
+3. efficiency table 更清楚，便于与 SimToken 和 SAM2-based baselines 对比。
+可选扩展：若 SAM2 encoder feature 与文本/音频语义对齐不足，可增加一个轻量 projector：
+\[
+f'_{i,t} = W_v f_{i,t}
+\]
+但默认不引入额外大规模 visual-language encoder。
+---
+### Stage 3: Tube selection with null tube
+加入一个 learnable null tube：
+\[
+z_{null}
+\]
+将所有候选 tubes 与 null tube 一起输入 tube selector：
+\[
+P(i \mid video, audio, text) =
+\text{Softmax}([s_1, s_2, \dots, s_N, s_{null}])
+\]
+若 \(P(null)\) 最大，则输出空 mask；否则选择得分最高的 object tube。
+Existence probability 自然定义为：
+\[
+p_{exist} = 1 - P(null)
+\]
+#### Tube selector 默认结构
+默认采用：
+1. reference query \(q_{ref}=\text{MLP}([e_{text},e_{audio}])\)；
+2. tube tokens \(\{\tilde{z}_i\}_{i=1}^{N}\)；
+3. inter-tube self-attention；
+4. reference-conditioned cross-attention；
+5. per-tube classification head。
+必须做消融：
+- w/ inter-tube self-attention；
+- w/o inter-tube self-attention；
+- independent tube scoring，即每个 tube 独立通过 \([q_{ref}; \tilde{z}_i]\) 的线性层打分。
+---
+### Stage 4: SAM refinement
+选中 tube 后，默认只使用 tube bbox 作为 box prompt，并结合 text/audio semantic prompt 进行 SAM refinement：
+\[
+\hat{m}_t = \text{SAMRefine}(I_t, b_{i,t}, q_{ref})
+\]
+默认不使用 tube mask 作为 mask prompt，避免“自我精修”带来的解释问题。tube mask 只用于：
+1. 生成 bbox；
+2. 提取 tube feature；
+3. proposal matching；
+4. oracle upper bound 计算。
+需要额外做对照：
+- bbox-only prompt；
+- bbox + semantic prompt；
+- bbox + mask prompt。
+如果 bbox + mask prompt 没有明显收益，正文采用 bbox-only 或 bbox + semantic prompt 作为默认版本。
+---
+## 3. 数据审计与诊断子集构建
+正式训练前必须先完成数据审计。该步骤决定后续实验是否有足够说服力。v3 将数据审计升级为 **Phase -1**，其中 multi-expression 结构与 SimToken 复现是进入 Phase 0 的前置条件。
+### 3.1 必统计项目
+| 项目 | 目的 |
+|---|---|
+| 每个视频的 referring expression 数量 | 判断 H3 是否可以被直接训练和验证 |
+| 每个 GT object / tube 对应的 expression 数量 | 构建 H3 direct validation subset |
+| SimToken alignment loss 中正样本表达集 \(\mathcal{P}_i\) 是否可复用 | 决定 multi-expression training 的实现路径 |
+| Null 样本比例 | 判断 null tube / weighted CE 的训练难度 |
+| GT 目标可见帧比例 | 决定是否需要 frame-level existence；若比例低则不引入 |
+| 目标首次出现时间分布 | 构建 late-target subset，验证是否缓解 first-frame bias |
+| 同类多实例比例 | 验证 inter-tube reasoning 和 hard negative 是否必要 |
+| 小目标 / 遮挡目标比例 | 评估 proposal recall 风险 |
+| 音频依赖表达比例 | 验证 audio-conditioned compression 是否有空间 |
+| 空间关系表达比例 | 验证 spatial/relation query 是否必要 |
+| Proposal miss 与目标属性关系 | 分析 SAM2 proposal 对小目标、遮挡、unseen 类别的系统性偏差 |
+### 3.1.1 Multi-expression audit 的决策规则
+| 审计结果 | 对 H3 和 CondQFormer 的影响 |
+|---|---|
+| 每个视频平均 expression 数 > 1.5，且同一 GT object 有多个 expression | 正常推进 H3；使用 multi-expression training 和 direct cosine validation |
+| 多数视频只有 1 个 expression，但少量视频有多 expression | H3 作为诊断性贡献；在 multi-expression subset 上报告直接验证 |
+| 每个视频基本只有 1 个 expression | 不把 H3 作为核心 claim；CondQFormer 改述为 learned tube compression / multimodal query adaptation |
+---
+### 3.2 诊断子集
+至少构建以下子集。
+#### 3.2.1 Late-target subset
+目标首次可见帧位于视频后 50% 的样本。
+定义：
+\[
+t_{first} = \min \{t \mid g_t \neq \emptyset\}
+\]
+若：
+\[
+t_{first} > 0.5T
+\]
+则归入 late-target subset。
+---
+#### 3.2.2 Audio-critical subset
+v3 继续采用两阶段定义。
+**Stage A: 初筛**
+通过文本关键词筛选：
+- sounding；
+- making sound；
+- longest sound；
+- intermittent sound；
+- silent；
+- audio；
+- heard；
+- emitting sound；
+- playing instrument 等。
+**Stage B: 精筛**
+训练出 w/o Audio 版本后，将满足以下条件的样本归入 strict audio-critical subset：
+1. Full model 预测正确或显著优于阈值；
+2. w/o Audio 模型预测错误或 J/F 显著下降；
+3. 视频中存在至少两个视觉候选，单靠视觉无法稳定区分目标。
+这样避免“表达包含音频词但视觉上唯一可解”的伪 audio-critical 样本。
+---
+#### 3.2.3 Same-category distractor subset
+视频中存在多个同类别或高度相似候选对象，表达需要区分实例。
+优先数据来源：
+1. 数据集原始 object annotations；
+2. 若无现成标注，使用 CLIP / Grounding DINO / OWL-ViT 进行 zero-shot object discovery；
+3. 结合 SAM2 proposals 的 mask-pooled CLIP similarity 聚类，近似识别同类候选。
+该子集需要报告构建方式和人工抽查准确率，避免 Reviewer 质疑子集可靠性。
+---
+#### 3.2.4 Null subset
+原始 Null 样本，并进一步区分：
+1. visual object exists but not referred；
+2. audio exists but no valid visual target；
+3. text refers to absent object；
+4. audio-text conflict / ambiguous null。
+---
+#### 3.2.5 Small / occluded target subset
+用于分析 proposal miss。
+初始定义：
+- small：GT mask area 小于图像面积的 5%；
+- heavily occluded：连续可见帧少于 \(0.5T\)，或 mask area 在时序上剧烈波动；
+- partial target：目标只在部分帧出现。
+---
+#### 3.2.6 Multi-expression H3 subset
+用于直接验证 H3。
+样本条件：
+1. 同一视频中存在至少两个 referring expressions；
+2. 这些 expressions 指向同一 GT object / GT tube，或至少指向可稳定匹配的同一 target instance；
+3. expressions 在语义上存在差异，例如类别、动作、音频、空间关系、交互对象或时序片段不同；
+4. SAM2 proposals 中存在 matched GT tube，避免 proposal miss 干扰 H3 验证。
+报告内容：
+- 每个视频平均 expression 数；
+- 每个 GT object 平均 expression 数；
+- H3 subset 样本数���；
+- expression 差异类型分布；
+- 人工抽查准确率。
+## 4. Phase 0: Proposal Recall 与 Oracle 上界预实验
+这是 TubeToken 的 go / no-go 实验。若 proposal recall 或 oracle upper bound 不足，TubeToken 的性能上限会被 proposal 阶段限制。
+### 4.0 Phase -1 前置基准线：SimToken 复现
+在运行 Proposal Recall 与 Oracle 上界之前，必须先完成 SimToken 复现。
+要求：
+1. 使用与 TubeToken 后续实验一致的数据划分、输入分辨率、音频特征、训练 epoch、batch size、optimizer、scheduler 和 evaluation script。
+2. 以作者复现的 SimToken J/F/S 作为所有 Go/No-Go 条件中的主基准。
+3. 官方 SimToken 数字只作为旁注；若复现数字与官方数字差异超过 1.5 J&F，需要先定位差异来源。
+4. 论文中明确写作：
+> All comparisons are conducted under the same training configuration as SimToken (reproduced), with official results cited where applicable.
+### 4.1 设置
+- Proposal model: SAM2 automatic mask generation。
+- 关键帧策略：
+  - stride = 4；
+  - stride = 8；
+  - stride = 16；
+  - first / middle / last + audio-peak frames；
+  - uniform + motion-peak frames；
+  - uniform + audio-peak + motion-peak frames。
+- Propagation: 使用 SAM2 memory / tracking 机制生成完整 tube。
+- Candidate numbers: \(N=16,32,64,128\)。
+---
+### 4.2 Tube matching 定义
+v3 使用 **GT-visible-frame mean tube IoU**，避免 late-target 或 partial target 样本被空帧稀释。
+令：
+\[
+\mathcal{T}_g = \{t \mid g_t \neq \emptyset\}
+\]
+则：
+\[
+IoU_{tube}(o_i, g)=
+\frac{1}{|\mathcal{T}_g|}
+\sum_{t \in \mathcal{T}_g}
+IoU(m_{i,t}, g_t)
+\]
+若：
+\[
+\max_i IoU_{tube}(o_i, g) \ge 0.5
+\]
+则认为 GT 被 proposal 覆盖。
+同时报告更严格版本：
+\[
+IoU_{tube}^{all}
+=
+\frac{1}{T}
+\sum_{t=1}^{T}
+IoU(m_{i,t}, g_t)
+\]
+用于分析 tube 在 GT 不存在帧是否产生多余 mask。
+### 4.2.1 Oracle Refined J/F 精确定义
+**Oracle Tube J/F**：在 top-N candidate tubes 中选择 \(IoU_{tube}\) 最高的 tube，直接评估该 tube mask 的 J/F。
+**Oracle Refined J/F**：在 top-N candidate tubes 中选择 oracle tube，只使用该 tube 的 bbox 作为 SAM / SAM2 box prompt，经 refinement 后评估 J/F。
+约束：
+1. 不允许使用 GT mask 作为 mask prompt；
+2. 不允许使用 oracle GT box；
+3. bbox 来自 oracle proposal tube；
+4. refinement 设置必须与实际 Stage 4 默认设置一致。
+这样 Oracle Refined J/F 才是实际 TubeToken refinement 的可达上界，而不是依赖 GT mask 的理想化上界。
+---
+### 4.3 指标
+| 指标 | 解释 |
+|---|---|
+| Recall@16 / 32 / 64 / 128 | top-N tubes 中是否存在 GT tube |
+| Oracle Tube J/F | 总是选择 \(IoU_{tube}\) 最高 tube 的 proposal 上界 |
+| Oracle Refined J/F | 选择 oracle tube 后，仅用 proposal bbox prompt 做 SAM refinement 的上界 |
+| Proposal coverage by subset | 在 late-target、small、occluded、unseen 上分别报告 |
+| Proposal miss % | 未覆盖 GT 的样本比例 |
+| Average tubes per video | 计算量和 pruning 难度 |
+| Proposal generation latency | 评估效率 |
+| Tube temporal purity | tube 是否在 GT 不存在帧产生大量 false positive |
+---
+### 4.4 Go / No-Go 决策标准
+下列阈值中的 SimToken 均指 **作者复现的 SimToken**，不是仅引用官方数字。
+#### 4.4.1 Milestone 1 绿灯条件
+同时满足：
+1. Recall@32 ≥ 85%，其中 matching 使用 GT-visible-frame IoU ≥ 0.5；
+2. Oracle Tube J/F ≥ reproduced SimToken J/F + 5%；
+3. Oracle Refined J/F ≥ Oracle Tube J/F + 3%，说明 SAM refinement 有明确提升空间；
+4. Small / occluded subset Recall@32 ≥ 70%，避免 proposal 对关键困难样本存在系统性盲区。
+策略：TubeToken 正常推进，默认 Balanced 配置使用 \(N=32\)。
+#### 4.4.2 Milestone 1 黄灯条件
+| 条件 | 后续策略 |
+|---|---|
+| Recall@32 为 80%-85%，且 Oracle Tube J/F 满足绿灯条件 | 继续推进，但默认 \(N=64\)，并在论文中重点分析 proposal miss |
+| Oracle Tube J/F 仅 ≥ SimToken + 2%，但 Oracle Refined J/F ≥ SimToken + 5% | 继续推进，但论文重心从 selection 转向 refinement；强调 proposal-conditioned refinement |
+| Recall@32 ≥ 85%，但 small/occluded Recall@32 < 70% | 继续推进主线，但必须增加 detector-assisted proposals 或 high-resolution proposals 的备选实验 |
+#### 4.4.3 Milestone 1 红灯条件
+任一条件满足即暂停 TubeToken 主线，优先切换 EC-SimToken 或重做 proposal 阶段：
+1. Recall@64 < 80%；
+2. Oracle Tube J/F ≤ reproduced SimToken J/F；
+3. Recall@32 ≥ 85%，且 Oracle Refined J/F 与 Oracle Tube J/F 差距 < 1%，且 Oracle Tube J/F ≤ reproduced SimToken J/F + 2%。
+第三条红灯条件只使用 Phase 0 可观测量。其含义是：proposal 质量本身只比 SimToken 略好，bbox-only refinement 又几乎无增益，此时 TubeToken 在该数据集上缺��足够立足点，不应依赖 Milestone 2 之前无法验证的 selection 收益预期。
+---
+### 4.5 若 recall 不足的备选策略
+1. 增加关键帧数量；
+2. 使用 audio-peak / motion-peak keyframes；
+3. 对文本中出现的类别词使用 open-vocabulary detector 生成 boxes，再送 SAM2；
+4. 使用 SimToken / EC-SimToken 的 mask 作为额外 proposal；
+5. 引入 hybrid fallback：若 proposal confidence 低，则回退到 global semantic prompt segmentation。
+## 5. Baseline 与模型变体
+### 5.1 必须复现 / 对比的模型
+| 模型 | 用途 |
+|---|---|
+| EEMC | 原始 Ref-AVS baseline |
+| TSAM | SAM-based Ref-AVS baseline |
+| SAM2-LOVE | SAM2-based Ref-AVS baseline |
+| SimToken | 最直接对比对象，必须复现 |
+| EC-SimToken | 强化后的 global token baseline，用于证明 TubeToken 不是只打 weak baseline |
+| SimToken + SAM2 proposals | 控制 SAM2 proposals 带来的收益，采用零参数 reranking |
+| SAM2 proposals + learned reranker（no null tube） | 分离 learned tube reranker 与 null tube 的贡献 |
+| SimToken + matched compute | 等计算量公平对照 |
+| TubeToken-Minimal | 最小 tube selection 框架 |
+| TubeToken-Full | 完整方法 |
+如果无法完整复现 EEMC、TSAM、SAM2-LOVE，可引用官方结果；但 SimToken、SimToken + SAM2 proposals、SAM2 proposals + learned reranker、SimToken + matched compute、TubeToken 必须在同一训练 / 输入 / 评估设置下比较。
+---
+### 5.2 TubeToken 主要消融
+| 变体 | 目的 |
+|---|---|
+| TubeToken-Full | 完整模型 |
+| TubeToken-Minimal | SAM2 proposals + fixed tube feature + selector + null tube，无 CondQFormer，无 refinement |
+| SAM2 proposals + learned reranker（no null tube） | 分离 learned selector 与 null tube 的贡献 |
+| w/o null tube | 验证显式 Null 建模 |
+| null tube → binary existence head | 比较 null tube 与额外二分类 head |
+| w/o null tube + mask-area threshold | 区分 Null 性能来自 tube 框架还是 null tube 设计 |
+| fixed Q-Former | 验证 conditioning 是否有效，而非参数量增加 |
+| text-conditioned only | 验证文本条件贡献 |
+| audio-conditioned only | 验证音频条件贡献 |
+| text+audio-conditioned | 完整条件化压缩 |
+| w/o inter-tube self-attention | 验证 tube 间相对比较是否必要 |
+| independent tube scoring | 每个 tube 独立通过 \([q_{ref};z_i]\) 线性打分 |
+| w/o SAM refinement | 验证 tube selection 本身能力 |
+| bbox prompt refinement | 默认 refinement 方案 |
+| bbox + semantic prompt refinement | 验证 semantic prompt 是否有贡献 |
+| bbox + mask prompt refinement | 检查 mask prompt 是否会带来收益或过拟合 |
+| N=16/32/64/128 | 分析 candidate 数量和 recall/效率 trade-off |
+| stride=4/8/16 | 分析关键帧数量和效率 trade-off |
+---
+### 5.3 公平性控制变体
+#### 5.3.1 SimToken + SAM2 proposals：零参数 proposal reranking baseline
+目的：回答 “TubeToken 的提升是否只是因为使用了 SAM2 proposals？”
+该 baseline 必须采用参数无关的 reranking，不能使用模糊的 “rerank or fusion” 写法。
+实现：
+1. 保持 SimToken 的 global `<SEG>` 生成方式，得到 \(F_{seg}\)。
+2. 使用与 TubeToken 完全相同的 SAM2 proposals 和 tube construction。
+3. 对每个 proposal tube 提取时序 mask-pooled feature \(f_{i,t}\)。
+4. 使用如下零参数分数：
+\[
+\text{score}(o_i)
+=
+F_{seg}^{\top}
+\cdot
+\frac{1}{|\mathcal{T}|}
+\sum_t f_{i,t}
+\]
+5. 选择分数最高的 proposal tube，并使用与 TubeToken-Minimal 一致的输出设置。
+该方案不引入额外可学习参数，与 SimToken 的 \(F_{seg}\) 使用方式一致，能最大限度避免 Reviewer 质疑对照组被弱化。
+---
+#### 5.3.2 SAM2 proposals + learned reranker（no null tube）
+目的：回答 “TubeToken-Minimal 的提升来自 learned tube selector，还是来自 null tube？”
+实现：
+1. 使用与 TubeToken-Minimal 相同的 SAM2 proposals、tube construction、tube feature 和 \(q_{ref}\)。
+2. 训练一个 learned reranker / classifier 对非 null candidate tubes 打分。
+3. 不加入 learnable null tube。
+4. Null case 使用 mask-area threshold 或 calibrated score threshold 处理。
+5. 与 TubeToken-Minimal 对比：若 TubeToken-Minimal 明显更好，说明 null tube 有独立贡献；若 learned reranker 已接近 TubeToken-Minimal，说明主要收益来自 learned tube selection。
+---
+#### 5.3.3 SimToken + matched compute：预注册等计算量 baseline
+目的：回答 “TubeToken 是否只是计算量换性能？”
+v4 固定唯一实现，不再保留多个候选方案：
+> **SimToken + multiple keyframe prompting with the same number of keyframes as TubeToken-Fast.**
+实现约定：
+1. 使用与 TubeToken-Fast 相同数量的关键帧，默认对应 TubeToken-Fast 的 stride=16 keyframe budget。
+2. 对每个关键帧分别运行 SimToken 的 global `<SEG>` / SAM prompting 流程。
+3. 将多�� keyframe 的预测通过同一 propagation / aggregation 规则合成为视频级 mask，规则必须在实验前固定。
+4. 不使用 SAM2 proposal tube reranking，不引入 learned tube selector，不引入 null tube。
+5. 报告 latency、FLOPs、SAM/SAM2 call 数、MLLM token count，使其与 TubeToken-Fast 的计算预算尽可能接近。
+选择该实现的原因：TubeToken-Fast 的额外计算主要来自更多关键帧与 proposal/propagation 处理，而 multiple keyframe prompting 是 SimToken 侧最直接、最可解释、最难被质疑的等计算量增强方式。该 baseline 必须在实验开始前预注册，不能根据最终结果临时更换。
+## 6. 训练设计
+### 6.1 Tube label assignment
+正样本视频中，选择 GT-visible-frame mean tube IoU 最大的 candidate tube 作为正 tube：
+\[
+i^* = \arg\max_i IoU_{tube}(o_i,g)
+\]
+若最大 IoU 小于 0.5，则标记为 proposal miss。训练时：
+- 不用于 tube classification loss；
+- 可用于 proposal miss 统计；
+- 不建议强行把低 IoU tube 当正样本，以免污染 selector。
+Null 样本中，正类为 null tube。
+---
+### 6.2 Loss function
+v3 默认总损失中 **不包含未定义的 \(\mathcal{L}_{cond}\)**。Null 加权并入 tube classification CE，而不是单独写成独立的 \(\mathcal{L}_{null}\)。
+默认总损失：
+\[
+\mathcal{L}
+=
+\mathcal{L}_{tube}^{weighted}
++
+\lambda_m y\mathcal{L}_{mask}
++
+\lambda_r\mathcal{L}_{rank}
+\]
+其中：
+\[
+\mathcal{L}_{tube}^{weighted}
+=
+\sum_i
+w_i \cdot
+\text{CE}(P(i \mid video,audio,text), y_i)
+\]
+- 正样本：\(w_i=1\)；
+- Null 样本：\(w_i=w_{null}\)，由 curriculum 控制；
+- \(\mathcal{L}_{mask}\)：BCE + Dice，只对非 Null 且非 proposal miss 样本计算；
+- \(\mathcal{L}_{rank}\)：hard negative ranking loss。
+Hard negative ranking：
+\[
+\mathcal{L}_{rank}
+=
+\sum_{j\in\mathcal{N}}
+\max(0,\Delta-s_{i^*}+s_j)
+\]
+#### 6.2.1 Optional \(\mathcal{L}_{cond}\) 辅助项
+如果实验中决定使用 attention supervision，则 \(\mathcal{L}_{cond}\) 必须单独定义、单独消融，不能作为默认损失悬空出现。
+可选定义：
+\[
+\mathcal{L}_{cond}
+=
+-
+\sum_{t,l}
+\bar{M}_{t,l}
+\log A_{t,l}
+\]
+其中：
+- \(A_{t,l}\)：CondQFormer 对第 \(t\) 帧第 \(l\) 个 patch / region 的 attention；
+- \(\bar{M}_{t,l}\)：归一化后的 GT mask 或 matched proposal mask；
+- 该项只在有可靠 GT spatial supervision 的样本上使用。
+若使用该项，则总损失写为：
+\[
+\mathcal{L}
+=
+\mathcal{L}_{tube}^{weighted}
++
+\lambda_m y\mathcal{L}_{mask}
++
+\lambda_r\mathcal{L}_{rank}
++
+\lambda_c\mathcal{L}_{cond}
+\]
+并报告 with / without \(\mathcal{L}_{cond}\)。
+---
+### 6.3 Multi-expression training for CondQFormer
+这是 H3 在训练层面的必要实现。
+适用前提：数据审计确认同一视频或同一 GT object 存在多个 referring expressions。
+训练方式：
+1. 对每个 multi-expression 样本，先生成一次 SAM2 proposals，得到共享 candidate tubes \(\mathcal{O}\)。
+2. 在同一个 batch 或 gradient accumulation window 中采样至少两个不同 expressions：\(r_a, r_b\)。
+3. 对同一组 tubes 分别构造条件化 query：
+\[
+Q_a = Q_0 + W_t e_{text}^{a} + W_a e_{audio}^{a} + W_{ta}(e_{text}^{a} \odot e_{audio}^{a})
+\]
+\[
+Q_b = Q_0 + W_t e_{text}^{b} + W_a e_{audio}^{b} + W_{ta}(e_{text}^{b} \odot e_{audio}^{b})
+\]
+4. 分别得到：
+\[
+\tilde{z}_{i}^{a} = \text{CondQFormer}(Q_a, \{f_{i,t}\}_{t=1}^{T})
+\]
+\[
+\tilde{z}_{i}^{b} = \text{CondQFormer}(Q_b, \{f_{i,t}\}_{t=1}^{T})
+\]
+5. 共享 tube proposals，但每个 expression 独立计算 tube selection loss。
+6. 如果两个 expressions 指向同一 GT tube，则要求 selection 都正确；不强制 \(\tilde{z}_{i}^{a}\) 与 \(\tilde{z}_{i}^{b}\) 相同，因为 H3 恰恰要求不同 expression 暴露不同证据。
+7. 如果两个 expressions 指向不同 targets，则作为 inter-expression hard negatives，用于强化同视频 instance discrimination。
+**实现注记：梯度冲突风险。**
+当两个 expressions 对同一 tube 需要关注不同证据时，例如一个表达依赖音频活跃帧，另一个表达依赖空间位置，CondQFormer 的共享参数可能收到相互冲突的梯度，造成训练振荡。若出现 loss oscillation、attention collapse 或正样本 Selection Acc 明显下降，采用以下缓解策略：
+1. 将不同 expression 的 forward / backward 放入同一 gradient accumulation window，但分开计算梯度后再累积，而不是在一个合并 forward 中强行混合；
+2. 训练早期优先采样语义差异较小的 expression pair，例如同为视觉表达或同为音频表达；
+3. 训练稳定后再逐步加入 cross-modality expression pair，例如 audio-expression vs spatial-expression；
+4. 单独记录 multi-expression pair 类型与训练稳定性，避免把梯度冲突误判为 conditioning 无效。
+训练记录：
+- batch 中 multi-expression 样本比例；
+- 每个 shared proposal set 对应的 expression 数；
+- expression pair 类型分布：visual-visual、audio-audio、visual-audio、spatial-audio；
+- 使用 multi-expression training 与不使用该训练策略的对比结果。
+若数据集不支持 multi-expression training，则必须在论文中降低 H3 的表述强度。
+---
+### 6.4 Null tube curriculum
+Null tube 训练初期不稳定，因此采用 curriculum：
+| 阶段 | epoch | Null 权重 \(w_{null}\) |
+|---|---:|---:|
+| Warmup | 0-2 | 2.0 |
+| Middle | 3-6 | 1.0 |
+| Final | 7+ | 0.5 |
+同时使用 Null oversampling，但必须明确目标比例。
+默认设置：
+- 每个 batch 中 Null 样本目标比例：25%；
+- 若原始 Null 比例高于 25%，不额外下采样，直接使用自然分布；
+- 若原始 Null 比例低于 25%，通过 oversampling 补足；
+- 单个 batch 中 Null 比例原则上不超过 33%，除非专门做采样比例消融。
+必须报告 Null sampling ratio 对以下指标的影响：
+- Null FPR；
+- Positive FNR；
+- Null S；
+- Tube Selection Acc@1；
+- “GT tube Top-3 but null tube Top-1” 错误比例。
+Null sampling ratio 消融：
+| Ratio | 目的 |
+|---:|---|
+| 0% | no oversampling baseline |
+| 12.5% | 弱 oversampling |
+| 25% | 默认设置 |
+| 33% | 较强 oversampling |
+| 50% | 检查是否导致过度保守预测 null |
+---
+### 6.5 Hard negative mining
+Hard negative mining 分阶段引入，避免工程依赖混乱。
+#### Milestone 2: TubeToken-Minimal 阶段
+只使用不依赖 CondQFormer 的 hard negatives：
+1. tube IoU 与 GT 较高但不是目标；
+2. 与 GT bbox / mask 空间位置接近；
+3. mask-pooled visual feature 与 GT 相似；
+4. 若有类别标签，则加入同类别不同实例。
+#### Milestone 3: CondQFormer 阶段
+加入 text/audio mismatch negatives：
+1. 与文本相似但音频不匹配；
+2. 与音频同步但文本不匹配；
+3. 与 audio-critical expression 高相关但不是 GT 的 tube；
+4. same-category distractor 中的高分错误 tube；
+5. 同一视频不同 expression 指向不同目标时，将非当前 expression 的目标 tube 作为 hard negative。
+## 7. 评价指标
+### 7.1 标准指标
+| 指标 | 说明 |
+|---|---|
+| Seen J / F / J&F | seen categories 分割质量 |
+| Unseen J / F / J&F | unseen categories 泛化能力 |
+| Mix J / F / J&F | 综合表现 |
+| Null S | Null subset 空目标表现 |
+---
+### 7.2 TubeToken 专属指标
+| 指标 | 说明 |
+|---|---|
+| Recall@N | proposal 阶段是否覆盖 GT |
+| Oracle Tube J/F | proposal 上界 |
+| Oracle Refined J/F | proposal + bbox-only refinement 上界 |
+| Tube Selection Acc@1 | GT tube 被覆盖时，Top-1 预测是否为 matched GT tube |
+| Tube Selection Acc@3 | matched GT tube 是否进入 Top-3 |
+| GT Top-3 but Null Top-1 Rate | GT tube 已在 Top-3，但 null tube 排名第 1 的比例 |
+| Null Accuracy | 是否正确选择 null tube |
+| Null FPR | Null 视频中错误选择非空 tube 的比例 |
+| Positive FNR | 正样本视频中错误选择 null tube 的比例 |
+| Existence AUC | \(p_{exist}=1-P(null)\) 的判别能力 |
+| Reliability Diagram / ECE | existence probability 是否校准 |
+| Refinement Gain | SAM refinement 前后 J/F 提升 |
+| Latency / FPS / Memory | 效率指标 |
+| \(AC\) | attention mass 是否集中在 GT region / GT tube |
+| \(\widehat{AC}_{tube}\) | 标准化 tube-level AC，定义为 \(N\cdot AC_{tube}\)，用于不同 N 之间比较 |
+| H3 Cosine Similarity Gap | 同一 tube 不同 expression 下 conditioned 与 fixed Q-Former 的 \(\tilde{z}_i\) 相似度差异 |
+**Tube Selection Acc 定义：**
+在 GT tube 被 proposal 覆盖的样本中，selector 的 Top-1 预测与 matched GT tube 一致的比例。proposal miss 样本不计入该指标，但必须单独报告。
+**Selection Acc@3 的 null 处理：**
+针对正样本评估 object-level Top-3 时，先从候选排名中排除 null tube，再判断 matched GT tube 是否进入 Top-3。否则 null tube 排名第 2 但 GT tube 排名第 3 的情况会被误计为 object selection 成功。与 null 校准相关的情况单独用 **GT Top-3 but Null Top-1 Rate** 报告，该指标在包含 null tube 的完整 ranking 上计算。
+若 Null 样本少于 200 个，Reliability Diagram 作为主要校准分析，ECE 仅作为辅助数字。
+---
+### 7.3 Error decomposition
+每个失败样本归类为：
+| 错误类型 | 判定标准 |
+|---|---|
+| Proposal miss | top-N candidate tubes 中无 tube 与 GT-visible-frame mean IoU ≥ 0.5 |
+| Selection error | GT tube 存在，且非 null tube 被错误选择为其他 object tube |
+| Refinement error | selector 选对，但 refined mask J/F 明显低 |
+| Null false positive | Null 视频中选择了非空 tube |
+| Null false negative | 正样本视频中选择了 null tube |
+| GT tube Top-3 but Null Top-1 | 正样本中 matched GT tube 已进入 Top-3，但 null tube 得分最高 |
+最后一类不应简单��入 Selection error 或 Null FN。它说明模型具备候选识别能力，但 existence / null 校准存在问题。
+**互斥归类优先级：**
+Error decomposition 必须保证每个失败样本只落入一个类别，避免各项占比相互重叠。默认优先级为：
+1. Proposal miss；
+2. Null FN with GT Top-3，即正样本中 null ranked 1st 且 matched GT tube 进入 object-level Top-3；
+3. Null FN without GT Top-3；
+4. Selection error；
+5. Refinement error；
+6. Null FP。
+报告时可以把第 2、3 类合并成总 Null FN，同时单独列出 GT Top-3 but Null Top-1 作为 Null FN 的校准子类型。
+该分析需要在 Seen、Unseen、Null、late-target、same-category distractor、audio-critical 子集上分别报告。
+## 8. 诊断实验
+### 8.1 Conditioning 是否真的有效
+v3 将 conditioning 诊断拆成两个层次：
+1. **Correctness level**：模型是否关注正确 GT 区域 / GT tube。对应 AC 与 \(\widehat{AC}_{tube}\)。
+2. **Expression-sensitivity level**：同一 tube 在不同 referring expressions 下是否产生不同证据摘要。对应 H3 direct validation。
+这两个层次不能混淆。高 AC 只能说明模型关注正确对象，不能直接证明 H3。
+#### 8.1.1 H3 direct validation：同一 tube 不同 expression 的表示差异
+适用子集：3.2.6 Multi-expression H3 subset。
+实验设置：
+1. 对同一视频生成一次 shared candidate tubes；
+2. 找到 matched GT tube \(o_{i^*}\)；
+3. 对同一视频的两个 expressions \(r_a,r_b\) 分别运行 fixed Q-Former 与 conditioned Q-Former；
+4. 记录同一 tube 的输出表示：\(\tilde{z}_{i^*}^{a}\)、\(\tilde{z}_{i^*}^{b}\)。
+指标：
+\[
+\text{CosSim}_{same\ tube}
+=
+\cos(\tilde{z}_{i^*}^{a},\tilde{z}_{i^*}^{b})
+\]
+报告：
+| Model | Same-tube cross-expression CosSim | Selection Acc@1 | H3 解释 |
+|---|---:|---:|---|
+| Fixed Q-Former | 1.0 |  | 不依赖 expression，确定性恒等 baseline |
+| Text-conditioned |  |  | 文本差异是否改变 tube summary |
+| Audio-conditioned |  |  | 音频差异是否改变 tube summary |
+| Text+Audio-conditioned |  |  | 完整条件化是否产生最大差异 |
+期望结果：
+- Fixed Q-Former 的 cross-expression CosSim \(\equiv 1.0\)，这是确定性 baseline，而不是经验近似；
+- Text+Audio-conditioned Q-Former 的 CosSim 显著低于 1.0；
+- CosSim 降低不能以 Selection Acc 下降为代价；
+- 若 CosSim 无差异但性能提升存在，则论文表述应改为 “learned compression improves selection”，而不是强称 “expression-conditioned evidence summarization”。
+---
+#### 8.1.2 Attention Concentration 指标
+对于 patch-level 或 frame-level attention \(A\)，定义：
+\[
+AC
+=
+\frac{
+\sum_{t,l} A_{t,l} \cdot \mathbf{1}[(t,l)\in GT]
+}{
+\sum_{t,l} A_{t,l}
+}
+\]
+若 attention 是 tube-level，则原始 tube attention concentration 为：
+\[
+AC_{tube}
+=
+\sum_i A_i \cdot \mathbf{1}[i=i^*]
+\]
+但 \(AC_{tube}\) 受 candidate 数 \(N\) 影响。为保证不同 N 下可比较，v3 使用标准化版本：
+\[
+\widehat{AC}_{tube}=N\cdot AC_{tube}
+\]
+其中随机基准恒为 1.0，完全集中在 GT tube 上时为 \(N\)。
+比较：
+- fixed Q-Former；
+- text-conditioned；
+- audio-conditioned；
+- text+audio-conditioned。
+并在以下表达类型上分别报告：
+1. audio-related expressions；
+2. spatial relation expressions；
+3. category-only expressions；
+4. same-category distractor samples；
+5. multi-expression H3 subset。
+---
+### 8.2 Audio robustness
+| 实验 | 目的 |
+|---|---|
+| audio removed | 测试音频模块整体贡献 |
+| audio amplitude zeroed, temporal length preserved | 区分音频缺失与全零音频特征；检查模型是否只利用“有无音频”信号 |
+| audio shuffled | 测试是否依赖时间同步 |
+| same-category audio swapped | 测试是否依赖细粒度音频差异 |
+| cross-category audio swapped | 测试是否使用音频语义，而非只检测音频存在 |
+| audio-text conflict | 测试冲突条件下模型是否合理退化 |
+| strict audio-critical subset | 测试音频关键样本上的收益 |
+**Audio swapped 分组要求**：
+1. Same-category swap：例如吉他声换另一段吉他声；
+2. Cross-category swap：例如吉他声换狗叫或人声。
+只有 cross-category swap 导致显著下降，并且 zeroed audio 与 removed audio 呈现可解释差异，才能更有力证明模型确实使用音频语义。
+---
+### 8.3 First-frame bias / temporal coverage
+| 实验 | 目的 |
+|---|---|
+| late-target subset | 目标后半段出现时是否优于 SimToken |
+| keyframe stride ablation | 分析关键帧覆盖对性能影响 |
+| partial target subset | 测试目标只在部分帧出现的鲁棒性 |
+| target disappears subset | 测试 tracking 稳定性 |
+| GT-visible-frame IoU vs all-frame IoU | 区分目标定位质量和多余 mask 问题 |
+---
+### 8.4 Same-category distractor
+报告：
+- TubeToken vs SimToken；
+- w/ self-attention vs w/o self-attention；
+- hard-negative ranking loss ablation；
+- Selection Acc@1 / Acc@3；
+- 同类干扰样本上的 error decomposition。
+重点验证 TubeToken 是否减少同类实例混淆。
+---
+### 8.5 Null threshold sensitivity
+虽然 TubeToken 使用 null tube，不需要手工 mask area threshold，但仍需要展示：
+\[
+p_{exist}=1-P(null)
+\]
+在不同 threshold 下的：
+- Null FPR；
+- Positive FNR；
+- J&F；
+- Null S；
+- GT tube Top-3 but Null Top-1 Rate。
+这能说明模型是否对阈值敏感。
+同时比较：
+1. null tube；
+2. binary existence head；
+3. mask-area threshold。
+## 9. Efficiency 与公平计算量对比
+Reviewer 会质疑 TubeToken 是否只是计算量换性能，因此必须主动报告效率与等计算量对照。
+### 9.1 需要报告的效率项
+| 项目 | 说明 |
+|---|---|
+| Proposal generation time | SAM2 AMG + keyframe processing，按 per video 统计 |
+| Tracking / propagation time | SAM2 memory propagation |
+| Tube selection time | conditional compression + selector，按 per expression 统计 |
+| SAM refinement time | bbox prompt refinement |
+| Total latency per video | 完整推理耗时，需区分单 expression 与多 expression 场景 |
+| FPS | 视频级速度 |
+| Peak GPU memory | 显存 |
+| MLLM token count | 与 SimToken 比较 |
+| Number of SAM/SAM2 calls | 计算量透明化 |
+| Candidate tube number | N=16/32/64/128 |
+| Keyframe stride | stride=4/8/16 |
+| Amortized proposal cost per expression | 多 expression 场景下，SAM2 proposal generation 对同一视频只运行一次，在 K 个 expressions 间摊销 |
+| Per-expression incremental cost | CondQFormer、selector、refinement 对每个 expression 的增量耗时 |
+---
+### 9.2 TubeToken 三种配置
+| 配置 | 默认设置 | 目的 |
+|---|---|---|
+| Fast | N=16, stride=16 | 接近 SimToken 计算预算 |
+| Balanced | N=32, stride=8 | 性能与效率折中 |
+| Accuracy | N=64 或 128, stride=4 | 追求最好性能 |
+---
+### 9.3 等计算量对比
+必须加入：
+1. **SimToken + matched compute**，固定为 multiple keyframe prompting with the same number of keyframes as TubeToken-Fast；
+2. **SimToken + SAM2 proposals**；
+3. **SAM2 proposals + learned reranker（no null tube）**；
+4. **TubeToken-Fast**。
+报告这些变体在接近 latency / FLOPs / SAM call 数量下的性能。matched compute baseline 的实现必须在实验前固定，不能在实验后根据结果从 multi-scale prompting、multiple decode attempts 等候选方案中挑选。
+若 TubeToken-Fast 显著优于 SimToken + matched compute，则可以有力回应“只是计算量换性能”的质疑。
+### 9.4 多 expression 场景下的 proposal amortization
+若同一视频有 \(K\) 个 referring expressions，TubeToken 的推理成本应拆分为：
+\[
+C_{video}
+=
+C_{proposal}^{video}
++
+K\cdot(C_{cond}^{expr}+C_{select}^{expr}+C_{refine}^{expr})
+\]
+其中 \(C_{proposal}^{video}\) 是 SAM2 AMG + propagation 的一次性 per-video 成本，不应被错误地重复计算 \(K\) 次。因此需要额外报告：
+| 指标 | 定义 |
+|---|---|
+| Proposal cost per video | 同一视频生成 candidate tubes 的一次性成本 |
+| Amortized proposal cost per expression | \(C_{proposal}^{video}/K\) |
+| Incremental expression cost | CondQFormer + selector + refinement 的 per-expression 成本 |
+| Total cost for K expressions | \(C_{proposal}^{video}+K\cdot C_{expr}\) |
+这既避免 Reviewer 误解 TubeToken 每个 expression 都要重跑 SAM2 proposals，也能展示 TubeToken 在多 expression 视频上的潜在效率优势。
+---
+## 10. 主表设计
+### 10.1 Main comparison table
+主表只保留公开 baseline、复现主基线和 TubeToken 主配置，避免把公平性控制变体全部塞入主表导致结构臃肿。公平性控制单独放入 10.2。
+| Method | Seen J&F | Unseen J&F | Mix J&F | Null S | FPS | Memory |
+|---|---:|---:|---:|---:|---:|---:|
+| EEMC |  |  |  |  |  |  |
+| TSAM |  |  |  |  |  |  |
+| SAM2-LOVE |  |  |  |  |  |  |
+| SimToken official |  |  |  |  |  |  |
+| SimToken reproduced |  |  |  |  |  |  |
+| EC-SimToken |  |  |  |  |  |  |
+| TubeToken-Balanced |  |  |  |  |  |  |
+| TubeToken-Accuracy |  |  |  |  |  |  |
+---
+### 10.2 Fairness analysis table
+该表专门回答公平性问题：TubeToken 的收益是否来自 SAM2 proposals、learned reranking、null tube 或额外计算量。
+| Method | Matched Proposal? | Matched Compute? | Null Modeling | Seen J&F | Unseen J&F | Mix J&F | Null S | FPS |
+|---|---|---|---|---:|---:|---:|---:|---:|
+| SimToken reproduced | No | Base | Implicit / mask output |  |  |  |  |  |
+| SimToken + SAM2 proposals zero-param rerank | Yes | No | SimToken implicit |  |  |  |  |  |
+| SAM2 proposals + learned reranker（no null tube） | Yes | Partial | threshold / calibrated score |  |  |  |  |  |
+| SimToken + matched compute（multiple keyframe prompting） | No | Yes, TubeToken-Fast budget | SimToken implicit |  |  |  |  |  |
+| TubeToken-Minimal | Yes | TubeToken-Fast/Balanced reported | learnable null tube |  |  |  |  |  |
+| TubeToken-Fast | Yes | Yes | learnable null tube |  |  |  |  |  |
+---
+### 10.3 Proposal analysis table
+| Split | Recall@16 | Recall@32 | Recall@64 | Oracle Tube J&F | Oracle Refined J&F bbox-only | Proposal Miss % |
+|---|---:|---:|---:|---:|---:|---:|
+| Seen |  |  |  |  |  |  |
+| Unseen |  |  |  |  |  |  |
+| Late-target |  |  |  |  |  |  |
+| Small/occluded |  |  |  |  |  |  |
+| Audio-critical |  |  |  |  |  |  |
+| Multi-expression H3 subset |  |  |  |  |  |  |
+---
+### 10.4 Ablation table
+| Variant | Seen J&F | Unseen J&F | Null S | Selection Acc@1 | Null FPR | GT Top-3 Null Top-1 | FPS |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| Full |  |  |  |  |  |  |  |
+| TubeToken-Minimal |  |  |  |  |  |  |  |
+| SAM2 proposals + learned reranker（no null tube） |  |  |  |  |  |  |  |
+| w/o null tube |  |  |  |  |  |  |  |
+| binary existence head |  |  |  |  |  |  |  |
+| mask-area threshold |  |  |  |  |  |  |  |
+| fixed Q-Former |  |  |  |  |  |  |  |
+| text-only cond |  |  |  |  |  |  |  |
+| audio-only cond |  |  |  |  |  |  |  |
+| text+audio cond |  |  |  |  |  |  |  |
+| w/o multi-expression training |  |  |  |  |  |  |  |
+| w/ optional \(\mathcal{L}_{cond}\) |  |  |  |  |  |  |  |
+| w/o self-attn |  |  |  |  |  |  |  |
+| independent scoring |  |  |  |  |  |  |  |
+| w/o refinement |  |  |  |  |  |  |  |
+| bbox+mask prompt |  |  |  |  |  |  |  |
+---
+### 10.5 Error decomposition table
+| Split | Proposal Miss | Selection Error | Refinement Error | Null FP | Null FN | GT Top-3 but Null Top-1 |
+|---|---:|---:|---:|---:|---:|---:|
+| Seen |  |  |  | - |  |  |
+| Unseen |  |  |  | - |  |  |
+| Null | - | - | - |  | - | - |
+| Same-category |  |  |  | - |  |  |
+| Late-target |  |  |  | - |  |  |
+| Audio-critical |  |  |  | - |  |  |
+| Multi-expression H3 subset |  |  |  | - |  |  |
+说明：Late-target、Same-category、Audio-critical 通常为正样本子集，因此 Null FP 不适用，用 “-” 标记；若某个子集定义中包含 Null 样本，则需要拆成 positive / null 两行。
+---
+### 10.6 Conditioning analysis table
+| Model | Overall \(\widehat{AC}_{tube}\) | Audio-expression \(\widehat{AC}_{tube}\) | Spatial-expression \(\widehat{AC}_{tube}\) | Same-category \(\widehat{AC}_{tube}\) | Cross-expression CosSim | Selection Acc@1 |
+|---|---:|---:|---:|---:|---:|---:|
+| Fixed Q-Former |  |  |  |  |  |  |
+| Text-conditioned |  |  |  |  |  |  |
+| Audio-conditioned |  |  |  |  |  |  |
+| Text+Audio-conditioned |  |  |  |  |  |  |
+## 11. 可视化计划
+### 11.1 必做可视化
+1. **Tube selection visualization**
+   展示 top-5 candidate tubes、selector score、最终选择。
+2. **Null case visualization**
+   展示 null tube 得分最高，输出空 mask。
+3. **Same-category distractor**
+   展示两个相似对象，TubeToken 正确选择目标 tube。
+4. **Late-target case**
+   展示目标不在第一帧时，TubeToken 仍能通过 tube 选择找到目标。
+5. **Conditional attention map**
+   同一视频、不同 expression 下，compressor 关注不同 tube/时间片段。
+6. **Attention Concentration visualization**
+   展示 fixed Q-Former 与 conditioned Q-Former 的 attention mass 差异。
+7. **Failure cases**
+   至少展示 proposal miss、selection error、refinement error 三类失败。
+---
+### 11.2 可视化标准
+每个案例应包含：
+- 输入视频关键帧；
+- expression；
+- audio waveform 或 audio activity；
+- candidate tubes；
+- selection scores；
+- selected tube；
+- final mask；
+- GT mask；
+- 对应的 error category 或 diagnostic subset 标签。
+---
+## 12. 实施顺序与里程碑
+### Phase -1: 数据审计与 SimToken 复现
+目标：确认 H3 是否具备数据基础，并建立所有 Go/No-Go 判断的主基准。
+交付物：
+- SimToken reproduced result；
+- reproduced vs official 差异分析；
+- multi-expression audit；
+- H3 subset 构建结果；
+- Null 样本比例与 batch sampling 计划。
+Phase -1 的两个任务可以并行启动：SimToken 复现用于建立所有阈值的主基准，multi-expression audit 用于决定 H3 的叙事强度。
+Go / No-Go 条件：
+| Phase -1 结果 | 建议 |
+|---|---|
+| SimToken 复现与官方差异 ≤ 1.5 J&F，且每个视频平均 expression 数 > 1.5 | 按 v4 计划全面推进 Phase 0，H3 保持 P0 级直接验证 |
+| SimToken 复现与官方差异 ≤ 1.5 J&F，但每个视频基本只有 1 个 expression | 推进 Phase 0，但 H3 direct validation 从 P0 降为 P2，论文采用回退叙事 |
+| SimToken 复现差异 > 1.5 J&F | 暂停后续实验，先定位复现差异，因为所有 Go/No-Go 阈值都依赖该基准 |
+Phase -1 结束时必须明确说明 H3 属于强验证、弱验证还是叙事回退。
+---
+### Milestone 1: 数据审计与 proposal recall
+目标：判断 TubeToken 是否可行。
+交���物：
+- 数据统计表；
+- Recall@N；
+- Oracle Tube J/F；
+- Oracle Refined J/F bbox-only；
+- proposal miss 分析；
+- go / no-go 决策。
+绿灯条件：
+- Recall@32 ≥ 85%；
+- Oracle Tube J/F ≥ reproduced SimToken J/F + 5%；
+- Oracle Refined J/F ≥ Oracle Tube J/F + 3%；
+- Small / occluded subset Recall@32 ≥ 70%。
+黄灯条件：
+- Recall@32 为 80%-85%，但 Oracle Tube J/F 满足绿灯条件：推进但默认 N=64；
+- Oracle Tube J/F 仅 ≥ SimToken + 2%，但 Oracle Refined J/F ≥ SimToken + 5%：推进但论文重心转向 refinement。
+红灯条件：
+- Recall@64 < 80%；
+- Oracle Tube J/F ≤ reproduced SimToken J/F；
+- Recall@32 ≥ 85%，且 Oracle Refined J/F 与 Oracle Tube J/F 差距 < 1%，且 Oracle Tube J/F ≤ reproduced SimToken J/F + 2%；
+- proposal 对 small / occluded / unseen 存在不可接受的系统性盲区。
+---
+### Milestone 2: TubeToken-Minimal + Fairness Controls
+实现最小版本：
+- SAM2 proposals；
+- tube construction；
+- fixed tube feature；
+- selector + null tube；
+- no conditional Q-Former；
+- no SAM refinement。
+同时实现公平性控制：
+1. SimToken + SAM2 proposals 零参数 reranking；
+2. SAM2 proposals + learned reranker（no null tube）；
+3. SimToken + matched compute；
+4. w/o null tube + mask-area threshold。
+目标：验证 object tube selection 是否优于 global token baseline，并排除“只是 SAM2 proposals 更强”或“只是计算量更多”的解释。
+绿灯条件：
+- TubeToken-Minimal 的 Seen / Unseen J&F 均优于 reproduced SimToken ≥ 2%；
+- TubeToken-Minimal 优于 SimToken + SAM2 proposals；
+- TubeToken-Minimal 的 Null S ≤ SimToken Null S × 1.5；
+- Tube Selection Acc@1 ≥ 70%。
+黄灯条件：
+- TubeToken-Minimal 优于 SimToken 但不优于 SimToken + SAM2 proposals：说明 proposal 贡献占主导，需要强化 selector 或调整论文叙事；
+- TubeToken-Minimal 仅在 Null 子集优于 SimToken，Seen / Unseen 持平：继续推进 Milestone 3，但不能把 Minimal 作为主要贡献。
+红灯条件：
+- TubeToken-Minimal 在 Seen / Unseen 均不优于 SimToken，且不优于 SimToken + SAM2 proposals：重新设计 selector 或回退 EC-SimToken。
+---
+### Milestone 3: 加入 Conditional Compression
+实现：
+- fixed Q-Former；
+- text-conditioned Q-Former；
+- audio-conditioned Q-Former；
+- text+audio-conditioned Q-Former；
+- multi-expression training；
+- H3 cosine similarity validation。
+目标：证明 conditioning 本身有效，而非 learnable Q-Former 参数量带来的提升。
+必须交付：
+- conditioning ablation；
+- \(\widehat{AC}_{tube}\)；
+- H3 cross-expression CosSim；
+- attention visualization；
+- audio-critical subset 结果；
+- audio zeroed / removed / shuffled / swapped robustness。
+绿灯条件：
+- Text+Audio conditioned Q-Former 在 Seen / Unseen 均优于 Fixed Q-Former ≥ 1.5%；
+- \(\widehat{AC}_{tube}\) 在 audio-related expressions 上 conditioned ≥ fixed × 1.3；
+- 同一视频不同 expression 下，CondQFormer 的 \(\tilde{z}_i\) CosSim 明显低于 Fixed Q-Former；
+- strict audio-critical subset 上性能提升 ≥ 2%。
+黄灯条件：
+- CondQFormer 整体提升明显，但 \(\widehat{AC}_{tube}\) 差异不显著：论文改述为 learned tube compression；
+- Text-only 已足够好，Audio conditioning 额外收益 < 0.5%：audio conditioning 改为 robustness improvement，不作为主贡献。
+红灯条件：
+- Fixed Q-Former 与 Text+Audio conditioned Q-Former 差距 < 0.5%，且所有子集无收益：conditioning 无效，考虑 CLIP visual features 或回退论文叙事。
+---
+### Milestone 4: 加入 SAM Refinement
+实现：
+- bbox prompt refinement；
+- bbox + semantic prompt refinement；
+- bbox + mask prompt 作为对照。
+目标：证明 refinement 的贡献，并确认默认方案。
+绿灯条件：
+- Bbox prompt refinement 在 J 上优于 w/o refinement ≥ 2%；
+- Oracle Refined J/F 与实际 TubeToken-Full J/F 的差距 ≤ 10%；
+- Bbox + mask prompt 不显著优于 bbox-only。
+黄灯条件：
+- Refinement 提升 < 1%：将 SAM refinement 降为 optional module，论文重心转回 tube selection。
+红灯条件：
+- Bbox + mask prompt 显著优于 bbox-only，且差距来自 mask prompt 的 GT-quality dependency：说明 proposal mask 质量不足，需要回到 Milestone 1 改 proposal。
+---
+### Milestone 5: 完整实验与论文分析
+完成：
+- 主表；
+- 消融；
+- hard subset；
+- error decomposition；
+- efficiency；
+- equal-compute comparison；
+- 可视化；
+- failure case；
+- reliability diagram / threshold sensitivity。
+## 13. 风险与应对
+| 风险 | 严重程度 | 应对 |
+|---|---|---|
+| Ref-AVSBench 缺少 multi-expression 结构 | 极高 | 不将 H3 作为主贡献；叙事回退为 learned tube compression / proposal-conditioned instance grounding |
+| SimToken 复现与官方数字差异过大 | 高 | 先定位训练、输入、评估差异；所有后续 Go/No-Go 使用 reproduced number |
+| Multi-expression training 出现梯度冲突 | 中高 | 使用 gradient accumulation 分开累积不同 expression 的梯度；早期采样语义差异较小的 expression pair，稳定后再引入 cross-modality pair |
+| SimToken + matched compute 实现被质疑 | 高 | 实验前固定为 multiple keyframe prompting with TubeToken-Fast keyframe budget，不保留事后选择空间 |
+| 多 expression efficiency 被误解为每个 expression 重跑 proposals | 中 | 报告 proposal per-video cost、amortized proposal cost per expression 和 incremental expression cost |
+| Recall@32 低于 80% | 极高 | 增加 proposal 数、引入 detector、使用 hybrid fallback |
+| Oracle Tube J/F 不高于 reproduced SimToken | 极高 | 暂停 TubeToken 主线，改 refinement、高分辨率特征、proposal 方法或回退 EC-SimToken |
+| Oracle Refined J/F 定义不公平 | 高 | 固定为 oracle proposal bbox-only，不使用 GT mask prompt |
+| SimToken + SAM2 proposals 对照过弱 | 高 | 使用零参数 \(F_{seg}\) reranking，并公开公式 |
+| TubeToken-Minimal 优于 SimToken 但不优于 SimToken + SAM2 proposals | 高 | 说明 proposal 是主要贡献，需强化 tube selector 或调整论文叙事 |
+| learned reranker 与 TubeToken-Minimal 差距很小 | 中高 | null tube 贡献有限；Null 相关 claim 降级 |
+| \(\mathcal{L}_{cond}\) 定义不清 | 中高 | 默认删除；若使用则单独定义并做 with/without 消融 |
+| Null tube 不稳定 | 中高 | 25% Null oversampling + weighted CE curriculum；报告采样比例敏感性 |
+| Null oversampling 过强导致正样本误判 Null | 高 | 监控 Positive FNR 与 GT Top-3 but Null Top-1 Rate |
+| conditioning 只带来小幅提升 | 高 | 强化诊断子集、\(\widehat{AC}_{tube}\)、H3 CosSim、fixed Q-Former 对照 |
+| H3 CosSim 无明显差异 | 高 | 不强调 expression-conditioned summarization；改强调 learned compression 或 selection architecture |
+| TubeToken 计算量过大 | 高 | 报告 Fast/Balanced/Accuracy 与 matched-compute baseline |
+| refinement 提升不明显 | 中 | 将重点转向 selection accuracy 与 hard cases；refinement 作为 optional module |
+| self-attention 无贡献 | 低 | 删除 self-attention，采用更简洁 selector |
+| attention map 不可解释 | 中高 | 使用 \(\widehat{AC}_{tube}\)、query 分组、H3 CosSim 重新诊断 |
+| 与 SAM2 工程强绑定 | 中 | 明确核心贡献在 tube-level text/audio selection，不在 proposal generation |
+## 14. 实验优先级
+### P0: 必须完成
+1. SimToken 复现与官方结果差异分析；
+2. Multi-expression audit；
+3. Proposal Recall@N；
+4. Oracle Tube J/F 和 bbox-only Oracle Refined J/F；
+5. TubeToken-Minimal vs SimToken；
+6. TubeToken-Minimal vs SimToken + SAM2 proposals；
+7. SAM2 proposals + learned reranker（no null tube）；
+8. TubeToken-Fast vs SimToken + matched compute（固定为 multiple keyframe prompting）；
+9. Null tube ablation；
+10. mask-area threshold Null baseline；
+11. Null oversampling ratio ablation；
+12. fixed Q-Former vs text+audio conditioned Q-Former；
+13. \(\widehat{AC}_{tube}\)；
+14. H3 cross-expression CosSim（若 multi-expression audit 支持；否则降为 P2）；
+15. Error decomposition；
+16. GT Top-3 but Null Top-1 Rate；
+17. Efficiency table。
+---
+### P1: 强烈建议完成
+1. late-target subset；
+2. strict audio-critical subset；
+3. same-category distractor subset；
+4. threshold sensitivity；
+5. conditioning attention visualization；
+6. H3 cross-expression visualization；
+7. self-attention ablation；
+8. Reliability Diagram；
+9. same-category vs cross-category audio swap；
+10. audio amplitude zeroed, temporal length preserved。
+---
+### P2: 有时间再做
+1. audio shuffled；
+2. cross-dataset validation, e.g., AVSBench / MeViS；
+3. frame-level existence；
+4. open-vocabulary detector assisted proposals；
+5. manual hard negative benchmark；
+6. hybrid fallback with EC-SimToken；
+7. optional \(\mathcal{L}_{cond}\) attention supervision。
+## 15. 预期论文叙事
+### 15.1 正常叙事：H3 成立时
+若 multi-expression audit、multi-expression training、H3 CosSim 和 \(\widehat{AC}_{tube}\) 均支持 H3，建议论文主线写成：
+> Existing Ref-AVS methods often compress multimodal evidence into a global semantic token, implicitly coupling existence judgment, instance grounding, and frame-level segmentation. We find that this implicit coupling becomes fragile in samples requiring instance-level comparison, temporal coverage, explicit null reasoning, or expression-dependent temporal evidence. We therefore formulate Ref-AVS as text-audio conditioned object-tube retrieval followed by mask refinement. Based on this view, we propose TubeToken, which constructs candidate object tubes, summarizes each tube with expression-conditioned temporal evidence, selects the referred tube through multimodal reasoning, handles Null cases via a learnable null tube, and refines the selected tube with SAM.
+Introduction 中建议加入数据驱动的动机，例如：
+- SimToken 在 same-category distractor subset 上下降多少；
+- SimToken 在 late-target subset 上下降多少；
+- 去掉 audio 后 audio-critical subset 上下降多少；
+- Null false positive 是否集中在某类样本；
+- fixed Q-Former 与 conditioned Q-Former 在 H3 subset 上的 CosSim 差异。
+这能把叙事从“我们认为 global token 不好”升级为“我们用诊断数据证明 global token 有系统性弱点”。
+### 15.2 回退叙事：H3 不强时
+若数据集中 multi-expression 不足，或 conditioned Q-Former 的 H3 CosSim / \(\widehat{AC}_{tube}\) 证据不足，避免强称 “expression-conditioned evidence summarization”。建议改为：
+> We formulate Ref-AVS as proposal-conditioned instance grounding with explicit null reasoning. TubeToken improves robustness by decomposing global segmentation into candidate object tube construction, learned tube selection, null-aware existence modeling, and optional mask refinement.
+此时论文主贡献应改为：
+1. candidate object tube formulation；
+2. explicit null tube / existence modeling；
+3. fairness-controlled comparison with SimToken + SAM2 proposals and matched compute；
+4. diagnostic error decomposition；
+5. optional learned compression rather than strong conditioning claim。
+## 16. 最小可接受结论标准
+若最终结果满足以下条件，可以支撑一篇完整论文：
+1. SimToken 复现可信，且所有关键比较基于 reproduced SimToken；
+2. Recall@32 或 Recall@64 足够高，且 Oracle Tube J/F 明确高于 reproduced SimToken，证明 proposal 不是不可接受的瓶颈；
+3. Oracle Refined J/F 使用 bbox-only prompt，且明确高于 Oracle Tube J/F，证明 refinement 有可达收益；
+4. TubeToken 在 Seen / Unseen / Mix 不低于 SimToken 超过 2 个点；若主集只持平，必须在 Null、late-target、same-category、audio-critical 子集上有显著提升，并提供效率-鲁棒性-可解释性三维论证；
+5. TubeToken-Fast 在接近计算预算下优于 SimToken + matched compute（multiple keyframe prompting）；
+6. TubeToken-Minimal 优于 SimToken + SAM2 proposals，证明 tube selection 框架本身有效；
+7. SAM2 proposals + learned reranker（no null tube）与 TubeToken-Minimal 的对比能解释 selector 与 null tube 的各自贡献；
+8. fixed Q-Former 明显弱于 text+audio conditioned Q-Former；
+9. 如果主张 H3，则必须满足：multi-expression audit 支持、multi-expression training 有效、Fixed Q-Former CosSim \(\equiv 1.0\) 而 conditioned CosSim 显著低于 1.0，且 \(\widehat{AC}_{tube}\) 有提升；
+10. null tube 明显优于 mask-area threshold 和 binary existence head；
+11. Null oversampling 没有导致 Positive FNR 或 GT Top-3 but Null Top-1 Rate 不可接受地上升；
+12. error decomposition 能清楚说明主要失败来自 proposal miss、selection error、refinement error、Null FP/FN 还是 Null 校准；
+13. efficiency 虽然可能更高，但 Fast/Balanced/Accuracy setting 显示计算-性能 trade-off 合理。
+如果第 2 点不成立，应及时回退到 EC-SimToken 路线，避免在低 recall 的 TubeToken 上投入过多。如果第 9 点不成立，应保留 TubeToken 框架，但下调 CondQFormer / H3 的论文权重。
+## 17. 最终执行建议
+推荐按照以下顺序推进：
+1. **先做 Phase -1：SimToken 复现 + multi-expression audit。**
+   这是所有 Go/No-Go 条件和 H3 叙事是否成立的前提。
+2. **再做 Phase 0：proposal recall + bbox-only Oracle Tube / Refined J/F。**
+   这是 TubeToken 能否成立的硬前提，且 Oracle Refined J/F 必须与实际 refinement 设置一致。
+3. **再做 Milestone 2 的 fairness controls。**
+   TubeToken-Minimal、SimToken + SAM2 proposals 零参数 reranking、SAM2 proposals + learned reranker（no null tube）、SimToken + matched compute（multiple keyframe prompting）必须同时完成。
+4. **确认 tube 框架有效后再加入 CondQFormer。**
+   若 multi-expression 数据充足，必须同步加入 multi-expression training 与 H3 CosSim；若不足，则不要把 H3 写成主贡献。
+5. **最后加入 refinement。**
+   refinement 是性能增强项，不应成为论文叙事的唯一支柱。若 bbox-only refinement 提升很小，应将其降为 optional module。
+这一路径可以最大程度降低风险：如果 proposal recall 或 oracle upper bound 不理想，可以及时切回 EC-SimToken；如果 TubeToken-Minimal 已经显示出明显优势，再继续投入完整 TubeToken 是合理的；如果 H3 验证不足，可以保留 tube-level retrieval 贡献，同时修改 CondQFormer 的叙事。
+---
+## Appendix A. Reviewer 建议落地检查表
+| Reviewer 建议 | v3 落地位置 | 状态 |
+|---|---|---|
+| 增加 H3 直接验证，不能只用 AC | 1.2, 3.2.6, 8.1.1, 10.5, 12 | 已落实 |
+| 检查数据集 multi-expression ��构 | 3.1, 3.1.1, Phase -1 | 已落实 |
+| CondQFormer 显式利用 multi-expression training | 6.3, 12 Milestone 3 | 已落实 |
+| Go/No-Go 使用 reproduced SimToken，而非不明来源数字 | 4.0, 4.4, 12 | 已落实 |
+| Oracle Refined J/F 使用 bbox-only prompt，不用 GT mask | 4.2.1, 4.3, 10.2 | 已落实 |
+| SimToken + SAM2 proposals 使用零参数 reranking | 5.3.1 | 已落实 |
+| 增加 SAM2 proposals + learned reranker（no null tube） | 5.1, 5.2, 5.3.2, 10.1, 10.3, 12 | 已落实 |
+| 删除或定义悬空的 \(\mathcal{L}_{cond}\) | 6.2, 6.2.1 | 已落实 |
+| 明确 Null oversampling 比例 | 6.4, 14 | 已落实 |
+| 增加 GT Top-3 but Null Top-1 错误类型 | 7.2, 7.3, 10.4 | 已落实 |
+| 使用标准化 \(\widehat{AC}_{tube}\) | 7.2, 8.1.2, 10.5 | 已落实 |
+| 增加 audio amplitude zeroed 控制实验 | 8.2, 14 | 已落实 |
+| 修正 Error decomposition 表 Late-target 缺列 | 10.4 | 已落实 |
+| Main table 加入 TubeToken-Minimal | 10.1 | 已落实 |
+| 写入各 Milestone 绿灯 / 黄灯 / 红灯条件 | 12 | 已落实 |
+| 增加叙事回退方案 | 15.2, 16, 17 | 已落实 |
+| 固定 SimToken + matched compute 的唯一实现 | 5.3.3, 9.3, 10.2, 12 | v4 已落实 |
+| 修正 Phase 0 第三条红灯条件为可观测量 | 4.4.3, 12 Milestone 1 | v4 已落实 |
+| Fixed Q-Former CosSim baseline 精确为 1.0 | 1.2, 8.1.1, 10.6, 16 | v4 已落实 |
+| 增加 multi-expression training 梯度冲突风险 | 6.3, 13 | v4 已落实 |
+| 主表精简，公平性控制移入独立表 | 10.1, 10.2 | v4 已落实 |
+| 增加多 expression proposal amortization efficiency | 9.1, 9.4 | v4 已落实 |
+| Selection Acc@3 排除 null tube | 7.2 | v4 已落实 |
+| Error decomposition 使用互斥优先级 | 7.3 | v4 已落实 |
+| Phase -1 Go/No-Go 明确 SimToken 复现与 H3 audit 分支 | 12 Phase -1 | v4 已落实 |

TubeToken_Phase0_Experiment_Log.md ADDED Viewed

	@@ -0,0 +1,284 @@

+# TubeToken Phase -1 / Phase 0 Experiment Log
+This document records the actual experiment progress, observations, and next actions for the TubeToken v4 plan.
+## Phase -1 Summary
+### Data Audit
+Audit output:
+```text
+Expressions: 20459
+Videos: 3574
+Objects (vid, fid): 7461
+Splits: val 1349, train 14113, test_s 2288, TODO 25, test_u 1656, test_n 1028
+Expressions/video mean: 5.724
+Expressions/video median: 6.0
+Videos with >=2 expressions: 3521
+Expressions/object mean: 2.742
+Objects with >=2 expressions: 5836
+H3 candidate objects: 5781
+H3 candidate expressions: 18614
+Null split expressions: 1028 (5.02%)
+Audio-keyword expressions: 15890 (77.67%)
+Spatial-keyword expressions: 5924 (28.96%)
+Same-category distractor heuristic expressions: 2563 (12.53%)
+Small-target expressions: 10037
+Partial-target expressions: 33
+Area-unstable expressions: 41
+Late-target expressions: 0
+```
+Decision:
+- Multi-expression structure is strong.
+- H3 direct validation remains a P0 target.
+- Null modeling is feasible but needs oversampling / curriculum because Null ratio is only about 5%.
+- Small-target proposal recall is a major risk.
+- Late-target subset is not useful under the current GT visibility definition.
+### SimToken Reproduction
+Reproduced results:
+```text
+test_seen:
+  mIoU = 0.7189123889
+  F    = 0.8113823722
+  J&F  = 0.7651473806
+test_unseen:
+  mIoU = 0.6996124670
+  F    = 0.7915967433
+  J&F  = 0.7456046051
+test_n:
+  S = 0.0117917573
+```
+Paper/report result:
+```text
+Seen:   J 72.0, F 81.3, J&F 76.7
+Unseen: J 69.8, F 79.1, J&F 74.5
+Mix:    J 70.9, F 80.2, J&F 75.6
+Null S: 0.012
+```
+Decision:
+- SimToken reproduction passes Phase -1.
+- Difference from the report is far below the 1.5 J&F pause threshold.
+- Later Go/No-Go thresholds should use reproduced SimToken as the reference.
+Working Phase 0 reference:
+```text
+SimToken seen J&F   = 0.7651
+SimToken unseen J&F = 0.7456
+Seen/unseen average = 0.7554
+Target Oracle Tube J&F for green light ~= 0.8054
+```
+## Phase 0 Proposal Experiments
+### Implementation Notes
+Scripts added:
+```text
+tools/tubetoken/phase0_common.py
+tools/tubetoken/generate_sam2_proposals.py
+tools/tubetoken/evaluate_phase0_proposals.py
+tools/tubetoken/evaluate_oracle_refine_sam2.py
+```
+SAM2 proposal generation uses:
+- SAM2 automatic mask generation on keyframes.
+- SAM2 video propagation to form tubes.
+- Cache format: one `.npz` per video with `masks`, `scores`, `keyframes`, and `boxes_xyxy`.
+Important implementation correction:
+- Initial unidirectional propagation was invalid for Phase 0 because proposals from later keyframes were not truly propagated backward.
+- Bidirectional propagation was added.
+- Group-by-keyframe propagation was tested but performed slightly worse than shared-state bidirectional propagation on smoke evaluation.
+### Smoke Results
+#### Unidirectional Smoke, stride=8, N=128, 5 videos
+Result:
+```text
+all:    R@16=0.800, R@32=0.900, R@64=1.000, R@128=1.000, Oracle J&F=0.9577
+small:  R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9798
+test_s: R@16=0.700, R@32=0.850, R@64=1.000, R@128=1.000, Oracle J&F=0.9743
+test_u: R@16=1.000, R@32=1.000, R@64=1.000, R@128=1.000, Oracle J&F=0.9244
+```
+Interpretation:
+- Code path worked, but the sample was too small and optimistic.
+#### Shared-state Bidirectional Smoke, stride=8, N=64, 30 videos
+Result:
+```text
+all:             n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91%
+audio_keyword:   n=130, R@16=0.738, R@32=0.923, R@64=0.977, Oracle J&F=0.9214, miss=2.31%
+h3_candidate:    n=163, R@16=0.718, R@32=0.883, R@64=0.951, Oracle J&F=0.9080, miss=4.91%
+small:           n=51,  R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9654, miss=0.00%
+spatial_keyword: n=14,  R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.9106, miss=0.00%
+test_s:          n=43,  R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8409, miss=18.60%
+test_u:          n=120, R@16=0.750, R@32=0.950, R@64=1.000, Oracle J&F=0.9321, miss=0.00%
+```
+Interpretation:
+- Bidirectional propagation fixed the small smoke behavior.
+- However, `test_s` remained much weaker than `test_u`.
+- Full validation was required before making a Phase 0 decision.
+#### Group-by-keyframe Bidirectional Smoke, stride=8, N=64, 30 videos
+Result:
+```text
+all:             n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59%
+audio_keyword:   n=130, R@16=0.738, R@32=0.877, R@64=0.931, Oracle J&F=0.9138, miss=6.92%
+h3_candidate:    n=163, R@16=0.718, R@32=0.847, R@64=0.914, Oracle J&F=0.9024, miss=8.59%
+small:           n=51,  R@16=0.647, R@32=0.882, R@64=1.000, Oracle J&F=0.9695, miss=0.00%
+spatial_keyword: n=14,  R@16=0.500, R@32=0.929, R@64=1.000, Oracle J&F=0.8945, miss=0.00%
+test_s:          n=43,  R@16=0.628, R@32=0.698, R@64=0.814, Oracle J&F=0.8416, miss=18.60%
+test_u:          n=120, R@16=0.750, R@32=0.900, R@64=0.950, Oracle J&F=0.9241, miss=5.00%
+```
+Decision:
+- Group-by-keyframe is worse than shared-state bidirectional for recall.
+- Use shared-state bidirectional as the current best SAM2 propagation setting.
+### Full Results: stride=8, N=64
+Full shared-state bidirectional result:
+```text
+all:             n=3944, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7491, miss=24.62%
+area_unstable:   n=18,   R@16=0.556, R@32=0.556, R@64=0.889, Oracle J&F=0.7114, miss=11.11%
+audio_keyword:   n=2844, R@16=0.475, R@32=0.610, R@64=0.766, Oracle J&F=0.7569, miss=23.42%
+h3_candidate:    n=3932, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7488, miss=24.64%
+partial:         n=8,    R@16=0.250, R@32=0.250, R@64=1.000, Oracle J&F=0.8123, miss=0.00%
+same_category:   n=330,  R@16=0.482, R@32=0.588, R@64=0.709, Oracle J&F=0.7261, miss=29.09%
+small:           n=1631, R@16=0.237, R@32=0.392, R@64=0.633, Oracle J&F=0.6367, miss=36.73%
+spatial_keyword: n=965,  R@16=0.331, R@32=0.476, R@64=0.658, Oracle J&F=0.6714, miss=34.20%
+test_s:          n=2288, R@16=0.326, R@32=0.483, R@64=0.657, Oracle J&F=0.6674, miss=34.27%
+test_u:          n=1656, R@16=0.665, R@32=0.755, R@64=0.887, Oracle J&F=0.8618, miss=11.29%
+```
+Decision:
+- `stride=8, N=64` is a Phase 0 red-light configuration.
+- It fails the v4 Go/No-Go criteria:
+  - Overall Recall@32 is below 85%.
+  - Overall Recall@64 is below 80%.
+  - Small-target Recall@32 is far below 70%.
+  - Oracle Tube J&F is below the target `SimToken + 5`.
+  - `test_s` Oracle J&F is far below reproduced SimToken seen J&F.
+- Do not proceed to TubeToken-Minimal with this proposal cache.
+Main bottleneck:
+- Proposal recall, especially for `test_s`, small targets, and spatial expressions.
+- Bidirectional propagation does not solve the full-set miss problem, so the problem is likely candidate generation / ranking / keyframe coverage, not just temporal direction.
+## Next Experiment
+### Goal
+Determine whether the red-light result is caused by top-64 truncation or by missing proposals at generation time.
+### Step 1: Export R@64 Miss Video List
+Command:
+```bash
+cd /workspace/SimToken
+conda activate simtoken
+python - <<'PY'
+import csv
+from pathlib import Path
+src = Path("runs/tubetoken_phase0/eval_stride8_n64_bidir/sample_metrics.csv")
+out = Path("runs/tubetoken_phase0/miss_videos_r64.txt")
+vids = set()
+with src.open() as f:
+    for r in csv.DictReader(f):
+        if r["recall@64"] != "True":
+            vids.add(r["vid"])
+out.write_text("\n".join(sorted(vids)) + "\n")
+print("miss videos:", len(vids))
+print("wrote:", out)
+PY
+```
+### Step 2: Test N=128 on Miss Videos
+Command:
+```bash
+mkdir -p runs/tubetoken_phase0/proposals_stride8_n128_miss
+python tools/tubetoken/generate_sam2_proposals.py \
+  --data_dir /workspace/SimToken/data \
+  --out_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride8_n128_miss \
+  --video_list /workspace/SimToken/runs/tubetoken_phase0/miss_videos_r64.txt \
+  --splits test_s,test_u \
+  --sam2_repo /workspace/sam2 \
+  --model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \
+  --checkpoint /workspace/sam2/checkpoints/sam2.1_hiera_large.pt \
+  --stride 8 \
+  --max_tubes 128 \
+  --device cuda \
+  --amp_dtype bf16 \
+  --quiet_sam2 \
+  --no_group_by_keyframe \
+  2>&1 | tee runs/tubetoken_phase0/proposals_stride8_n128_miss.log
+```
+Evaluate:
+```bash
+mkdir -p runs/tubetoken_phase0/eval_stride8_n128_miss
+python tools/tubetoken/evaluate_phase0_proposals.py \
+  --data_dir /workspace/SimToken/data \
+  --proposal_dir /workspace/SimToken/runs/tubetoken_phase0/proposals_stride8_n128_miss \
+  --out_dir /workspace/SimToken/runs/tubetoken_phase0/eval_stride8_n128_miss \
+  --audit_csv /workspace/SimToken/runs/tubetoken_phase_minus1/audit_full/audit_samples.csv \
+  --splits test_s,test_u \
+  --video_list /workspace/SimToken/runs/tubetoken_phase0/miss_videos_r64.txt \
+  --recall_ns 16,32,64,128 \
+  2>&1 | tee runs/tubetoken_phase0/eval_stride8_n128_miss.log
+```
+Report:
+```bash
+cat runs/tubetoken_phase0/eval_stride8_n128_miss/report.md
+```
+Expected decision:
+- If `R@128` on miss videos improves strongly, run full `stride=8, N=128`.
+- If `R@128` remains low, candidate count is not the main issue; next test should increase keyframe coverage with `stride=4`.
+- If `stride=4` remains low, move to detector-assisted proposals or high-resolution proposal generation before TubeToken-Minimal.

__pycache__/load_model.cpython-312.pyc ADDED Viewed

Binary file (21.2 kB). View file

load_model.py CHANGED Viewed

@@ -208,7 +208,7 @@ def collate_fn(batch, tokenizer=None):
 import torch.multiprocessing as mp
 if __name__ == "__main__":
-    mp.set_start_method("spawn")
     set_seed(42)
     tokenizer = transformers.AutoTokenizer.from_pretrained(
         args.mllm,
@@ -224,14 +224,15 @@ if __name__ == "__main__":
     print("seg_token_idx: ", seg_token_idx)
-    val_dataset_s = REFAVS('test_s', args, tokenizer, input_type='refer')
-    # val_dataset_u = REFAVS('test_u', args, tokenizer, input_type='refer')
-    # val_dataset_n = REFAVS('test_n', args, tokenizer, input_type='refer')
-    val_dataloader_s = DataLoader(val_dataset_s, batch_size=1, shuffle=False, num_workers=4, collate_fn=partial(collate_fn, tokenizer=tokenizer))
-    # val_dataloader_u = DataLoader(val_dataset_u, batch_size=1, shuffle=False, num_workers=4, collate_fn=partial(collate_fn, tokenizer=tokenizer))
-    # val_dataloader_n = DataLoader(val_dataset_n, batch_size=2, shuffle=False, num_workers=4, collate_fn=partial(collate_fn, tokenizer=tokenizer))
@@ -449,24 +450,25 @@ if __name__ == "__main__":
         for batch in tqdm(dataloader, desc=f"Evaluating on Null"):
             input_dict = dict_to_cuda(batch)
-            with torch.no_grad():
-                output_dict = model.forward(images=input_dict["images"],
-                                            images_clip=input_dict["images_clip"],
-                                            audio_features=input_dict["audio_feats"],
-                                            image_features=input_dict["image_feats"],
-                                            input_ids=input_dict["input_ids"],
-                                            labels=input_dict["labels"],
-                                            attention_masks=input_dict["attention_masks"],
-                                            masks_list=input_dict["masks"],
-                                            resize_list=input_dict["resizes"],
-                                            orgsize_list=input_dict["orgsizes"],
-                                            conversation_list=input_dict["convs"],
-                                            refs_num=input_dict["refs_num"],
-                                            fids=input_dict["fids"],
-                                            vids=input_dict["vids"],
-                                            contrast=args.ct_weight,
-                                            ref_ids=input_dict["ref_ids"],
-                                            inference=True)
             pred_masks = output_dict["pred_masks"]  # list[B]:[num_seg, T, H, W]
             gt_masks = output_dict["gt_masks"]  # list[B]:[num_seg, T, H, W]
             for i in range(len(pred_masks)):
@@ -482,9 +484,9 @@ if __name__ == "__main__":
-    valuate(model, val_dataloader_s, 'test_seen')
-    # valuate(model, val_dataloader_u, 'test_unseen')
-    #
-    # valuate_Null(model, val_dataloader_u)

 import torch.multiprocessing as mp
 if __name__ == "__main__":
+    mp.set_start_method("spawn", force=True)
     set_seed(42)
     tokenizer = transformers.AutoTokenizer.from_pretrained(
         args.mllm,
     print("seg_token_idx: ", seg_token_idx)
+    eval_splits = {split.strip() for split in args.eval_splits.split(",") if split.strip()}
+    val_dataset_s = REFAVS('test_s', args, tokenizer, input_type='refer') if 'test_s' in eval_splits else None
+    val_dataset_u = REFAVS('test_u', args, tokenizer, input_type='refer') if 'test_u' in eval_splits else None
+    val_dataset_n = REFAVS('test_n', args, tokenizer, input_type='refer') if 'test_n' in eval_splits else None
+    val_dataloader_s = DataLoader(val_dataset_s, batch_size=1, shuffle=False, num_workers=4, collate_fn=partial(collate_fn, tokenizer=tokenizer)) if val_dataset_s is not None else None
+    val_dataloader_u = DataLoader(val_dataset_u, batch_size=1, shuffle=False, num_workers=4, collate_fn=partial(collate_fn, tokenizer=tokenizer)) if val_dataset_u is not None else None
+    val_dataloader_n = DataLoader(val_dataset_n, batch_size=1, shuffle=False, num_workers=0, collate_fn=partial(collate_fn, tokenizer=tokenizer)) if val_dataset_n is not None else None
         for batch in tqdm(dataloader, desc=f"Evaluating on Null"):
             input_dict = dict_to_cuda(batch)
+            with torch.cuda.amp.autocast(dtype=torch.bfloat16, enabled=True):
+                with torch.no_grad():
+                    output_dict = model.forward(images=input_dict["images"],
+                                                images_clip=input_dict["images_clip"],
+                                                audio_features=input_dict["audio_feats"],
+                                                image_features=input_dict["image_feats"],
+                                                input_ids=input_dict["input_ids"],
+                                                labels=input_dict["labels"],
+                                                attention_masks=input_dict["attention_masks"],
+                                                masks_list=input_dict["masks"],
+                                                resize_list=input_dict["resizes"],
+                                                orgsize_list=input_dict["orgsizes"],
+                                                conversation_list=input_dict["convs"],
+                                                refs_num=input_dict["refs_num"],
+                                                fids=input_dict["fids"],
+                                                vids=input_dict["vids"],
+                                                contrast=args.ct_weight,
+                                                ref_ids=input_dict["ref_ids"],
+                                                inference=True)
             pred_masks = output_dict["pred_masks"]  # list[B]:[num_seg, T, H, W]
             gt_masks = output_dict["gt_masks"]  # list[B]:[num_seg, T, H, W]
             for i in range(len(pred_masks)):
+    if val_dataloader_s is not None:
+        valuate(model, val_dataloader_s, 'test_seen')
+    if val_dataloader_u is not None:
+        valuate(model, val_dataloader_u, 'test_unseen')
+    if val_dataloader_n is not None:
+        valuate_Null(model, val_dataloader_n)

runs/tubetoken_phase0/eval_stride8_n64_bidir/report.md ADDED Viewed

	@@ -0,0 +1,12 @@

+# TubeToken Phase 0 Proposal Evaluation
+- all: n=3944, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7491, miss=24.62%
+- area_unstable: n=18, R@16=0.556, R@32=0.556, R@64=0.889, Oracle J&F=0.7114, miss=11.11%
+- audio_keyword: n=2844, R@16=0.475, R@32=0.610, R@64=0.766, Oracle J&F=0.7569, miss=23.42%
+- h3_candidate: n=3932, R@16=0.469, R@32=0.597, R@64=0.754, Oracle J&F=0.7488, miss=24.64%
+- partial: n=8, R@16=0.250, R@32=0.250, R@64=1.000, Oracle J&F=0.8123, miss=0.00%
+- same_category: n=330, R@16=0.482, R@32=0.588, R@64=0.709, Oracle J&F=0.7261, miss=29.09%
+- small: n=1631, R@16=0.237, R@32=0.392, R@64=0.633, Oracle J&F=0.6367, miss=36.73%
+- spatial_keyword: n=965, R@16=0.331, R@32=0.476, R@64=0.658, Oracle J&F=0.6714, miss=34.20%
+- test_s: n=2288, R@16=0.326, R@32=0.483, R@64=0.657, Oracle J&F=0.6674, miss=34.27%
+- test_u: n=1656, R@16=0.665, R@32=0.755, R@64=0.887, Oracle J&F=0.8618, miss=11.29%

runs/tubetoken_phase0/eval_stride8_n64_bidir/sample_metrics.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

runs/tubetoken_phase0/eval_stride8_n64_bidir/summary.json ADDED Viewed

	@@ -0,0 +1,132 @@

+{
+  "all": {
+    "count": 3944,
+    "oracle_f": 0.7780505165622835,
+    "oracle_iou_all": 0.7200560851848016,
+    "oracle_iou_visible": 0.7204684844627691,
+    "oracle_j": 0.7200560864466854,
+    "oracle_jf": 0.749053301504484,
+    "proposal_miss": 971,
+    "proposal_miss_percent": 24.61967545638945,
+    "recall@16": 0.4685598377281947,
+    "recall@32": 0.5973630831643002,
+    "recall@64": 0.7538032454361054
+  },
+  "area_unstable": {
+    "count": 18,
+    "oracle_f": 0.7769002698555225,
+    "oracle_iou_all": 0.6459361637632052,
+    "oracle_iou_visible": 0.641666577094131,
+    "oracle_j": 0.6459361736374968,
+    "oracle_jf": 0.7114182217465094,
+    "proposal_miss": 2,
+    "proposal_miss_percent": 11.11111111111111,
+    "recall@16": 0.5555555555555556,
+    "recall@32": 0.5555555555555556,
+    "recall@64": 0.8888888888888888
+  },
+  "audio_keyword": {
+    "count": 2844,
+    "oracle_f": 0.7842819413589385,
+    "oracle_iou_all": 0.7295879604172519,
+    "oracle_iou_visible": 0.7293891077744052,
+    "oracle_j": 0.7295879614708254,
+    "oracle_jf": 0.756934951414882,
+    "proposal_miss": 666,
+    "proposal_miss_percent": 23.417721518987342,
+    "recall@16": 0.47468354430379744,
+    "recall@32": 0.610056258790436,
+    "recall@64": 0.7658227848101266
+  },
+  "h3_candidate": {
+    "count": 3932,
+    "oracle_f": 0.7777484301907281,
+    "oracle_iou_all": 0.7197934413123055,
+    "oracle_iou_visible": 0.7202070991842038,
+    "oracle_j": 0.7197934425788871,
+    "oracle_jf": 0.7487709363848074,
+    "proposal_miss": 969,
+    "proposal_miss_percent": 24.643947100712104,
+    "recall@16": 0.4687182095625636,
+    "recall@32": 0.5974059003051883,
+    "recall@64": 0.753560528992879
+  },
+  "partial": {
+    "count": 8,
+    "oracle_f": 0.8269168466522676,
+    "oracle_iou_all": 0.7977360785007477,
+    "oracle_iou_visible": 0.6766794174909592,
+    "oracle_j": 0.7977360699530978,
+    "oracle_jf": 0.8123264583026827,
+    "proposal_miss": 0,
+    "proposal_miss_percent": 0.0,
+    "recall@16": 0.25,
+    "recall@32": 0.25,
+    "recall@64": 1.0
+  },
+  "same_category": {
+    "count": 330,
+    "oracle_f": 0.7644448335532433,
+    "oracle_iou_all": 0.6878195029645943,
+    "oracle_iou_visible": 0.6874837929881668,
+    "oracle_j": 0.687819501173607,
+    "oracle_jf": 0.7261321673634256,
+    "proposal_miss": 96,
+    "proposal_miss_percent": 29.09090909090909,
+    "recall@16": 0.4818181818181818,
+    "recall@32": 0.5878787878787879,
+    "recall@64": 0.7090909090909091
+  },
+  "small": {
+    "count": 1631,
+    "oracle_f": 0.6917960314676159,
+    "oracle_iou_all": 0.581673126376625,
+    "oracle_iou_visible": 0.5810682598301485,
+    "oracle_j": 0.5816731270979948,
+    "oracle_jf": 0.6367345792828051,
+    "proposal_miss": 599,
+    "proposal_miss_percent": 36.72593500919681,
+    "recall@16": 0.23666462293071736,
+    "recall@32": 0.3917841814837523,
+    "recall@64": 0.6327406499080319
+  },
+  "spatial_keyword": {
+    "count": 965,
+    "oracle_f": 0.715782608316947,
+    "oracle_iou_all": 0.6269581168444804,
+    "oracle_iou_visible": 0.6278982803011947,
+    "oracle_j": 0.6269581183310804,
+    "oracle_jf": 0.6713703633240147,
+    "proposal_miss": 330,
+    "proposal_miss_percent": 34.196891191709845,
+    "recall@16": 0.3305699481865285,
+    "recall@32": 0.47564766839378236,
+    "recall@64": 0.6580310880829016
+  },
+  "test_s": {
+    "count": 2288,
+    "oracle_f": 0.7064674157375836,
+    "oracle_iou_all": 0.628373636981925,
+    "oracle_iou_visible": 0.6283693503971877,
+    "oracle_j": 0.6283736383630024,
+    "oracle_jf": 0.6674205270502909,
+    "proposal_miss": 784,
+    "proposal_miss_percent": 34.26573426573427,
+    "recall@16": 0.32604895104895104,
+    "recall@32": 0.4833916083916084,
+    "recall@64": 0.6573426573426573
+  },
+  "test_u": {
+    "count": 1656,
+    "oracle_f": 0.8769527718080132,
+    "oracle_iou_all": 0.8467284532332201,
+    "oracle_iou_visible": 0.8477165634132825,
+    "oracle_j": 0.8467284543304232,
+    "oracle_jf": 0.8618406130692231,
+    "proposal_miss": 187,
+    "proposal_miss_percent": 11.292270531400966,
+    "recall@16": 0.6654589371980676,
+    "recall@32": 0.7548309178743962,
+    "recall@64": 0.8870772946859904
+  }
+}

runs/tubetoken_phase0/proposals_stride8_n64_bidir/manifest.json ADDED Viewed

The diff for this file is too large to render. See raw diff

upload.log ADDED Viewed

The diff for this file is too large to render. See raw diff