yfan07
/

SimToken

Model card Files Files and versions

xet

Community

yfan07 commited on 3 days ago

Commit

08ff7f7

verified ·

1 Parent(s): 2751ecb

Add files using upload-large-folder tool

Browse files

Files changed (3) hide show

SEG_LTPO_results.md +157 -17
load_model.py +272 -25
seg_ltpo.py +618 -32

SEG_LTPO_results.md CHANGED Viewed

@@ -317,32 +317,172 @@ QLTPOConfig(
 )
 ```
 ---
-## Next Steps
-### Immediate (full-set confirmation)
-Run full evaluations with e0-modulated Stage 1 to confirm quick-validation trends at scale:
 ```bash
-# Full Null (~30 min) — expect S ≈ 0.0120 + small increase, less than +5%
-TRANSFORMERS_OFFLINE=1 python -W ignore load_model.py --eval_split test_n
-# Full Seen (~35 min) — expect mIoU gain ≥ +0.013
-TRANSFORMERS_OFFLINE=1 python -W ignore load_model.py --eval_split test_s
-# Full Unseen (~35 min) — expect mIoU gain ≥ +0.025 (from pre-e0 baseline +0.0295)
-TRANSFORMERS_OFFLINE=1 python -W ignore load_model.py --eval_split test_u
 ```
-**Decision criteria to promote e0-Stage1 to final method:**
-- Null S degradation < 5% relative (full set)
-- Seen mIoU gain ≥ +0.012
-- Unseen mIoU gain ≥ +0.022
-### If full-set confirms (future work)
-1. **F-score improvement (Stage 3)**: Current gain is mainly in mIoU (overlap); F-score (boundary precision/recall) lags. Candidate: boundary-oriented reward using SAM's low-res logit gradient sharpness or contour consistency across anchor frames.
-2. **Stronger e0 suppression ablation**: Test `e0_modulation="sqrt"` (g(e0) = sqrt(e0+ε)) to further compress Null tail. Only justified if full-set Null degradation exceeds 5%.
-3. **Stage 2 revisit**: R_align_det hurt at scale due to noisy z_in/z_out from low-quality initial masks. Possible fix: gate align signal by `R_iou_pred > 0.85` to only use it when initial mask is reliable.

 )
 ```
+### Full Unseen Evaluation with e0 (1656 samples)
+| Method | mIoU | F | Δ mIoU |
+|--------|------|---|--------|
+| Baseline | 0.6990 | 0.7926 | — |
+| q-LTPO S1 (no e0) | 0.7285 | 0.8013 | +0.0295 (+4.22%) |
+| **q-LTPO S1 (e0)** | **0.7240** | **0.7985** | **+0.0250 (+3.56%)** |
+e0 版本相比 no-e0 版本 mIoU 略低 (-0.0045)，但 Null 安全性更好。F 与 mIoU 的提升比例基本一致（约 60%）。
+**全量评估状态（更新）：**
+| Split | Baseline | q-LTPO S1 (e0) | Δ | Status |
+|-------|----------|----------------|---|--------|
+| Unseen (full, 1656) | 0.6990 / 0.7926 | 0.7240 / 0.7985 | +3.56% mIoU | ✅ Done |
+| Seen (full) | — | — | — | Pending |
+| Null (full, S↓) | 0.0120 | — | — | Pending |
+---
+## Direction B: Boundary Precision Experiments（已结束，结论为失败）
+### B-Step1: Multimask Post-Processing（彻底失败）
+用 SAM 多 mask 输出（K=3）替换单 mask 解码，分别用 iou_pred 和 Sobel edge score 选最佳候选。
+| Method | mIoU | F | ΔF vs s1 |
+|--------|------|---|----------|
+| s1 (single mask) | 0.6979 | 0.8024 | — |
+| s1_mm (iou_pred selection) | 0.6979 | 0.7917 | -0.0107 |
+| s1_mm_edge (Sobel selection) | 0.5715 | 0.6820 | -0.1204 |
+**根本原因：** SAM 内部的单 mask 选择已经最优；外部重选更差。Sobel 在 1024×1024 归一化空间中选到纹理碎片而非语义目标，灾难性失败。
+### B1: 非对称面积膨胀惩罚（机制性无效）
+假设：LTPO 导致 mask 向非目标区域膨胀（精度下降），加惩罚项压制。
+**实验结论：假设错误。** LTPO 期间 soft area 实际在下降（-16%）而非上升：
+```
+soft area:  0.1507 → 0.1267  (-16%)   ← background logits 更负
+hard area:  0.0635 → 0.0650  (+2.4%)  ← 实际 mask 区域微增
+```
+**"mask sharpening" 现象：** Adam 在 R_iou_pred 驱动下使 logit 更双峰化（前景更正、背景更负），soft area 因 93% 背景像素的贡献减少而下降。B1 惩罚的前提条件（soft area 上升）从未发生：
+```
+B1 activation rate : 0.025   ← 仅 2.5% 样本触发
+B1 mean excess     : 0.00002 ← 可忽略
+```
+**结论：** Direction B 从多 mask 选择到面积约束全部失败，不再追求。F-score 滞后于 mIoU 的根本原因不是 mask 精度，而是 reward 代理信号质量问题（见 Path A）。
+---
+## Direction II: Frame-Adaptive Token Optimization（初步探索，待后续）
+### 方法设计
+将单一共享 token q 扩展为视频 token 轨迹：
+```
+q_t = q_global + delta_t
+```
+其中 q_global 是全局共享 token，delta_t 是每个 anchor 帧的局部残差，初始化为 0。联合优化：
+```
+max  Σ_t [λ_iou · e0_t · R_iou(q_t) - λ_area · R_area(q_t)]
+   - λ_residual · ||delta||² - λ_smooth · Σ_t ||delta_t - delta_{t+1}||²  - λ_reg · ||q_global - q_init||²
+```
+每个 anchor 帧使用各自的 e0_t（per-frame 存在先验）。delta_t 受 hard clip 约束：`||delta_t|| ≤ scale × ||q_init||`。
+### 200-sample Probe Results（Unseen split）
+| Method | mIoU | F | reward gain p50 | delta ‖Δ‖ |
+|--------|------|---|-----------------|-----------|
+| baseline | 0.6745 | 0.7763 | — | — |
+| s1 | 0.6945 | 0.7773 | +0.0053 | — |
+| fa_base (无约束) | 0.6945 | 0.7711 | +0.0112 | 1.675 |
+| fa_smooth (λ_smooth=0.01) | 0.6960 | 0.7731 | +0.0104 | 1.488 |
+| fa_c03 (delta clip 0.3×) | 0.6959 | 0.7722 | +0.0112 | — |
+### 关键发现
+**Reward-metric gap（核心问题）：**
+```
+reward gain p50:   s1 = +0.0053    fa_c03 = +0.0112  (fa 高 2.1×)
+R_iou_pred 提升:   s1 +0.077       fa_c03 +0.114
+实际 mIoU 提升:    s1 +2.96%       fa_c03 +3.17%     (仅差 0.21%)
+```
+fa 拿到了多得多的 reward，但 mIoU 几乎没有额外提升，F 还略降。
+**结论：** 瓶颈不是优化结构，而是 R_iou_pred 本身的任务相关性不足。R_iou_pred 衡量"mask 有多干净"，不衡量"mask 是否包含正确的音频目标"。所有架构变体（单 token / frame-adaptive）都受同一个天花板限制。
+Direction II 不在旧 reward 下继续调参，等 Path A（新 reward）有正向信号后再考虑是否重新引入。
 ---
+## Path A: AVT-Aware Reward 重设计
+### 动机
+Ref-AVS 中的 referent 不一定是发声体本身（可能是拿着发声物体的人、与声源相关的对象）。纯音频对齐 reward 会将优化推向 sound source 而非 text 指向的 referent。需要 audio + text + global visual context 共同定义的 referent consistency。
+### AVT Proxy Reward 设计
+**核心洞察：** Fseg（= q_init）已经是 audio + video + text 的多模态融合 token，可直接作为 frozen AVT teacher。
+```python
+R_avt   = mean_t  cos(z_in_t,  q_init)
+R_avt_c = mean_t [cos(z_in_t,  q_init) - β · cos(z_out_t, q_init)]
+```
+- `z_in_t`：anchor 帧 t 的 soft-masked 图像特征（SAM 256-dim 空间）
+- `q_init`：frozen Fseg（AVT anchor，不参与优化梯度）
+- R_avt 高 → mask 区域与查询 referent 对齐；R_avt 低 → mask 指向错误目标
+与 Stage 2 的区别：Stage 2 用当前 q（移动）对齐 z_in（当前 mask），导致自我确认偏差；R_avt 用 q_init（固定）作为 teacher，打破偏差。
+### Step A0: Reward–Metric Correlation Study（下一步要做）
+**目的：** 在进入 full optimization 之前，先用数据验证新 reward 是否比 R_iou_pred 更能预测真实 metric 变化。
+**实验设置（200 samples, Unseen split）：**
+对每个（视频，segment）样本：
+1. Baseline decode → IoU_base, F_base
+2. q-LTPO s1 → q_best；记录 reward_gain、r_avt_gain、r_avt_c_gain（均在 q_ltpo_autograd 内计算）
+3. LTPO decode → IoU_ltpo, F_ltpo
+4. Δ = LTPO - baseline
+输出 Pearson 相关表：
+```
+Pearson r with ΔmIoU:
+  R_iou_pred_gain  : +0.xxx  ← 当前 proxy
+  R_avt_gain       : +0.xxx  ← cos(z_in, q_init)
+  R_avt_c_gain     : +0.xxx  ← 对比版本
+Wrong direction (gain>0 但 Δ<0):
+  R_iou / ΔmIoU : 0.xxx
+  R_avt / ΔmIoU : 0.xxx
+```
+**运行命令：**
 ```bash
+python load_model.py --eval_split test_u --max_eval_rows 200
+```
+**判断标准：**
+- `r(R_avt, ΔmIoU) > r(R_iou, ΔmIoU)` → AVT proxy 更好，进入 Step A1
+- 两者相近 → reward 本身不是瓶颈，需要重新审视
+- `R_avt / ΔF wrong frac` 明显低于 `R_iou / ΔF` → AVT 能解释 F-score 不跟随 mIoU 的现象
+### Step A1: Hybrid Reward（Step A0 验证后）
+```
+R_task = λ1 · e0 · R_iou_pred + λ2 · R_avt_c - λ3 · R_area_soft
 ```
+- R_iou_pred 继续负责 mask quality（shape quality signal）
+- R_avt_c 负责 referent correctness（task-specific signal）
+- 两者结合才有可能同时维持 IoU 并提升 F
+候选权重组合：`λ1=0.6, λ2=0.5, λ3=0.2`（AVT 作为辅助项，不完全取代 R_iou）。
+如果 Step A1 有正向信号，再考虑将 Direction II（frame-adaptive）和新 reward 结合。

load_model.py CHANGED Viewed

@@ -498,6 +498,8 @@ if __name__ == "__main__":
         get_sam_model, get_anchor_indices,
         QLTPOConfig, q_ltpo_autograd, check_grad_connectivity,
         reset_q_ltpo_stats, get_q_ltpo_stats,
     )
     def print_q_ltpo_stats(name: str) -> None:
@@ -521,6 +523,14 @@ if __name__ == "__main__":
         gains = sorted(s["reward_gain"] for s in stats)
         def _pct(v, p): return v[max(0, int(len(v) * p / 100) - 1)]
         mean_e0 = sum(s["e0"] for s in stats) / n
         print(f"\n  [q-LTPO stats | {name} | n={n}]")
         print(f"    acceptance rate      : {acc_rate:.3f}")
         print(f"    mean e0 (exist prior): {mean_e0:.4f}  ← should differ Null vs Seen")
@@ -529,20 +539,30 @@ if __name__ == "__main__":
         print(f"    mean drift ‖q−q₀‖   : {mean_drift:.4f}")
         print(f"    hit-clip ratio       : {clip_rate:.3f}")
         print(f"    R_iou_pred init→best : {mean_iou_init:.4f} → {mean_iou_best:.4f}")
         print(f"    area (hard) init→best: {mean_area_init:.4f} → {mean_area_best:.4f}")
         print(f"    reward↑ & area+20%↑  : {null_risk:.3f}  ← Null safety indicator")
-    def valuate_ltpo(model, dataloader, name, ltpo_cfg, optimize_fn=None, max_rows=-1):
         if optimize_fn is None:
             optimize_fn = ltpo_optimize
         """
-        Evaluate with SEG-LTPO-simple test-time optimisation.
-        For each sample:
-          1. Run the standard SimToken forward pass once to get initial Fseg.
-          2. Optimise Fseg on 4 anchor frames using antithetic ES (5 steps).
-          3. Decode the full video with the best Fseg found.
-          4. Fall back to the original Fseg when reward gating rejects the update.
         """
         model.eval()
         sam_model    = get_sam_model(model)
@@ -590,6 +610,7 @@ if __name__ == "__main__":
                 image_embeds_b = input_dict["image_feats"][b]   # [T, 256, 64, 64]
                 resize_b       = input_dict["resizes"][b]
                 orgsize_b      = input_dict["orgsizes"][b]
                 # Convert initial Fseg to float32 for stable optimisation.
                 # seg_emb_list[b]: [num_seg, 256] in bfloat16
@@ -609,6 +630,7 @@ if __name__ == "__main__":
                     pred_mask = decode_full_video(
                         best_fseg, image_embeds_b, sam_model,
                         resize_b, orgsize_b, model_dtype,
                     )  # [T, H, W]
                     pred_masks_ltpo.append(pred_mask)
@@ -699,6 +721,219 @@ if __name__ == "__main__":
         print(f"\n  LTPO valuate on Null:  S metric: {total_metric/count:.4f}")
     # ── Stage 0: gradient connectivity check ─────────────────────────────
     # Loads one image_embed directly from disk — no dataloader, no gt_mask,
     # no media frames required.  F_init is a unit-scale random vector that
@@ -846,32 +1081,44 @@ if __name__ == "__main__":
     # ── Run evaluation ────────────────────────────────────────────────────
-    ltpo_cfg   = LTPOConfig()
-    q_ltpo_cfg_s1 = QLTPOConfig(stage=1)
-    q_ltpo_cfg_s2 = QLTPOConfig(stage=2)
-    max_rows   = args.max_eval_rows     # -1 = all rows
     # --max_eval_rows 0  → Stage 0 + bypass equivalence check, then exit
     if max_rows == 0:
         run_stage0_check()
         run_bypass_test()
     elif _split == 'test_n':
-        # Safety check: Baseline vs q-LTPO Stage 1 only.
-        # ES-LTPO / Stage 2 are omitted — ES is no longer the primary method,
-        # and Stage 2 consistently underperforms Stage 1.  If Stage 1 shows
-        # notable deterioration here, add a small Best-of-2 ES subset run to
-        # distinguish "reward unsafe on Null" from "autograd more aggressive".
         valuate_Null(model, _dataloader, max_rows=max_rows)
         reset_q_ltpo_stats()
-        valuate_ltpo_null(model, _dataloader, q_ltpo_cfg_s1,
-                          optimize_fn=q_ltpo_autograd,     max_rows=max_rows)
-        print_q_ltpo_stats("null_q_ltpo_s1")
     else:
-        # Baseline + q-LTPO Stage 1 only.  ES series omitted — q-autograd is
-        # the primary method; Stage 2 consistently underperforms Stage 1.
         valuate(model, _dataloader, _split, max_rows=max_rows)
-        reset_q_ltpo_stats()
-        valuate_ltpo(model, _dataloader, f'{_split}_q_ltpo_s1',  q_ltpo_cfg_s1,
-                     optimize_fn=q_ltpo_autograd,     max_rows=max_rows)
-        print_q_ltpo_stats(f'{_split}_q_ltpo_s1')

         get_sam_model, get_anchor_indices,
         QLTPOConfig, q_ltpo_autograd, check_grad_connectivity,
         reset_q_ltpo_stats, get_q_ltpo_stats,
+        q_ltpo_frame_adaptive, decode_full_video_adaptive,
+        _compute_avt_proxy_reward,
     )
     def print_q_ltpo_stats(name: str) -> None:
         gains = sorted(s["reward_gain"] for s in stats)
         def _pct(v, p): return v[max(0, int(len(v) * p / 100) - 1)]
         mean_e0 = sum(s["e0"] for s in stats) / n
+        mean_mask_iou      = sum(s.get("mask_soft_iou",      0.0) for s in stats) / n
+        mean_iou_contrib   = sum(s.get("R_iou_contrib_gain", 0.0) for s in stats) / n
+        mean_soft_area_init = sum(s.get("r_area_soft_init",  0.0) for s in stats) / n
+        mean_soft_area_best = sum(s.get("r_area_soft_best",  0.0) for s in stats) / n
+        # B1 activation diagnostics
+        b1_excesses    = sorted(s.get("b1_peak_excess", 0.0) for s in stats)
+        b1_act_rate    = sum(1 for v in b1_excesses if v > 1e-8) / n
+        b1_mean_excess = sum(b1_excesses) / n
         print(f"\n  [q-LTPO stats | {name} | n={n}]")
         print(f"    acceptance rate      : {acc_rate:.3f}")
         print(f"    mean e0 (exist prior): {mean_e0:.4f}  ← should differ Null vs Seen")
         print(f"    mean drift ‖q−q₀‖   : {mean_drift:.4f}")
         print(f"    hit-clip ratio       : {clip_rate:.3f}")
         print(f"    R_iou_pred init→best : {mean_iou_init:.4f} → {mean_iou_best:.4f}")
+        print(f"    R_iou_contrib_gain   : {mean_iou_contrib:+.4f}  ← λ_iou·e0·Δiou")
+        print(f"    mask soft-IoU(init,best): {mean_mask_iou:.4f}  ← 1.0=mask不变")
         print(f"    area (hard) init→best: {mean_area_init:.4f} → {mean_area_best:.4f}")
+        print(f"    soft area init→best  : {mean_soft_area_init:.4f} → {mean_soft_area_best:.4f}")
+        print(f"    B1 activation rate   : {b1_act_rate:.3f}  ← frac(peak_area > e0)")
+        print(f"    B1 mean excess       : {b1_mean_excess:.5f}  ← mean ReLU(peak_area - e0)")
+        print(f"    B1 excess p10/50/90  : {_pct(b1_excesses,10):.5f} / {_pct(b1_excesses,50):.5f} / {_pct(b1_excesses,90):.5f}")
         print(f"    reward↑ & area+20%↑  : {null_risk:.3f}  ← Null safety indicator")
+        # Direction II: frame-adaptive delta diagnostics
+        delta_norms = [s.get("delta_norm", 0.0) for s in stats]
+        if any(v > 0 for v in delta_norms):
+            print(f"    mean delta ‖Δ‖       : {sum(delta_norms)/n:.4f}  ← per-anchor residual norm")
+    def valuate_ltpo(model, dataloader, name, ltpo_cfg, optimize_fn=None,
+                     max_rows=-1, multimask=False, use_edge=False):
         if optimize_fn is None:
             optimize_fn = ltpo_optimize
         """
+        Evaluate with SEG-LTPO test-time optimisation + optional boundary refinement.
+        decode_mode:
+          multimask=False, use_edge=False : original single-mask decode (default)
+          multimask=True,  use_edge=False : 3 candidates, SAM iou_pred selection (step 1a)
+          multimask=True,  use_edge=True  : 3 candidates, boundary-edge score (step 1b)
         """
         model.eval()
         sam_model    = get_sam_model(model)
                 image_embeds_b = input_dict["image_feats"][b]   # [T, 256, 64, 64]
                 resize_b       = input_dict["resizes"][b]
                 orgsize_b      = input_dict["orgsizes"][b]
+                rgb_b = input_dict["images"][b] if use_edge else None  # [T,3,H,W]
                 # Convert initial Fseg to float32 for stable optimisation.
                 # seg_emb_list[b]: [num_seg, 256] in bfloat16
                     pred_mask = decode_full_video(
                         best_fseg, image_embeds_b, sam_model,
                         resize_b, orgsize_b, model_dtype,
+                        rgb_frames=rgb_b, multimask=multimask,
                     )  # [T, H, W]
                     pred_masks_ltpo.append(pred_mask)
         print(f"\n  LTPO valuate on Null:  S metric: {total_metric/count:.4f}")
+    def valuate_ltpo_adaptive(model, dataloader, name, ltpo_cfg, max_rows=-1):
+        """Evaluate with Direction II frame-adaptive token optimization."""
+        model.eval()
+        sam_model      = get_sam_model(model)
+        model_dtype    = torch.bfloat16
+        num_frames     = 10
+        anchor_indices = get_anchor_indices(num_frames, ltpo_cfg.num_anchors)
+        total_iou    = 0
+        total_fscore = 0
+        count        = 0
+        _total = min(max_rows, len(dataloader)) if max_rows > 0 else len(dataloader)
+        for i, batch in enumerate(tqdm(dataloader, desc=f"FA-LTPO Evaluating on {name}", total=_total)):
+            if 0 < max_rows <= i:
+                break
+            input_dict = dict_to_cuda(batch)
+            with torch.cuda.amp.autocast(dtype=torch.bfloat16, enabled=True):
+                with torch.no_grad():
+                    output_dict = model.forward(
+                        images=input_dict["images"],
+                        images_clip=input_dict["images_clip"],
+                        audio_features=input_dict["audio_feats"],
+                        image_features=input_dict["image_feats"],
+                        input_ids=input_dict["input_ids"],
+                        labels=input_dict["labels"],
+                        attention_masks=input_dict["attention_masks"],
+                        masks_list=input_dict["masks"],
+                        resize_list=input_dict["resizes"],
+                        orgsize_list=input_dict["orgsizes"],
+                        conversation_list=input_dict["convs"],
+                        refs_num=input_dict["refs_num"],
+                        fids=input_dict["fids"],
+                        vids=input_dict["vids"],
+                        contrast=args.ct_weight,
+                        ref_ids=input_dict["ref_ids"],
+                        inference=True,
+                    )
+            gt_masks     = output_dict["gt_masks"]       # list[B]:[num_seg, T, H, W]
+            seg_emb_list = output_dict["seg_embeddings"] # list[B]:[num_seg, 256]
+            for b in range(len(input_dict["images"])):
+                image_embeds_b = input_dict["image_feats"][b]
+                resize_b       = input_dict["resizes"][b]
+                orgsize_b      = input_dict["orgsizes"][b]
+                F_init_b       = seg_emb_list[b].detach().float()
+                pred_masks_ltpo = []
+                for seg_idx in range(F_init_b.shape[0]):
+                    fseg_init = F_init_b[seg_idx : seg_idx + 1]
+                    q_global, delta = q_ltpo_frame_adaptive(
+                        fseg_init, image_embeds_b, anchor_indices,
+                        sam_model, model_dtype, ltpo_cfg,
+                    )
+                    pred_mask = decode_full_video_adaptive(
+                        q_global, delta, anchor_indices,
+                        image_embeds_b, sam_model,
+                        resize_b, orgsize_b, model_dtype,
+                    )
+                    pred_masks_ltpo.append(pred_mask)
+                pred_masks_b = torch.stack(pred_masks_ltpo, dim=0)
+                num_seg = pred_masks_b.shape[0]
+                T_      = pred_masks_b.shape[1]
+                iou     = utility.mask_iou(pred_masks_b, gt_masks[b])
+                fscore  = utility.Eval_Fmeasure(pred_masks_b, gt_masks[b], None)
+                total_iou    += iou    * num_seg * T_
+                total_fscore += fscore * num_seg * T_
+                count        += num_seg * T_
+        print(f"\n  FA-LTPO valuate on {name}:  miou: {total_iou/count:.4f}  fscore: {total_fscore/count:.4f}")
+    # ── Step A0: reward–metric correlation study ─────────────────────────
+    def _print_correlation_report(per_sample: list) -> None:
+        import numpy as np
+        n = len(per_sample)
+        if n == 0:
+            return
+        r_iou   = np.array([s["reward_gain"]   for s in per_sample], dtype=float)
+        r_avt   = np.array([s["r_avt_gain"]    for s in per_sample], dtype=float)
+        r_avt_c = np.array([s["r_avt_c_gain"]  for s in per_sample], dtype=float)
+        dm      = np.array([s["delta_miou"]     for s in per_sample], dtype=float)
+        df      = np.array([s["delta_f"]        for s in per_sample], dtype=float)
+        def pearson(x, y):
+            x = x - x.mean(); y = y - y.mean()
+            denom = np.sqrt((x ** 2).sum() * (y ** 2).sum())
+            return float((x * y).sum() / (denom + 1e-12))
+        def wrong_frac(gains, deltas):
+            return sum(1 for g, d in zip(gains, deltas) if g > 0 and d < 0) / n
+        print(f"\n  [Step A0: Reward–Metric Correlation | n={n}]")
+        print(f"    mean ΔmIoU : {dm.mean():+.4f}  (std {dm.std():.4f})")
+        print(f"    mean ΔF    : {df.mean():+.4f}  (std {df.std():.4f})")
+        print(f"\n    Pearson r  with ΔmIoU :")
+        print(f"      R_iou_pred_gain   : {pearson(r_iou,   dm):+.3f}   ← current proxy")
+        print(f"      R_avt_gain        : {pearson(r_avt,   dm):+.3f}   ← cos(z_in, q_init)")
+        print(f"      R_avt_c_gain      : {pearson(r_avt_c, dm):+.3f}   ← cos(z_in,q)-β·cos(z_out,q)")
+        print(f"\n    Pearson r  with ΔF :")
+        print(f"      R_iou_pred_gain   : {pearson(r_iou,   df):+.3f}")
+        print(f"      R_avt_gain        : {pearson(r_avt,   df):+.3f}")
+        print(f"      R_avt_c_gain      : {pearson(r_avt_c, df):+.3f}")
+        print(f"\n    Wrong direction  (gain>0  but  Δ<0):")
+        print(f"      R_iou / ΔmIoU : {wrong_frac(r_iou,   dm):.3f}")
+        print(f"      R_avt / ΔmIoU : {wrong_frac(r_avt,   dm):.3f}")
+        print(f"      R_iou / ΔF    : {wrong_frac(r_iou,   df):.3f}")
+        print(f"      R_avt / ΔF    : {wrong_frac(r_avt,   df):.3f}")
+    def valuate_ltpo_correlation_study(model, dataloader, ltpo_cfg, max_rows=-1):
+        """Step A0: per-sample reward–metric correlation study.
+        For each (video, segment) sample runs:
+          1. Baseline decode  (q_init → mask → IoU/F)
+          2. q-LTPO s1        (q_best → mask → IoU/F)
+        Records reward signals and ΔmIoU / ΔF per sample, then prints
+        Pearson correlation table to identify which reward best predicts
+        actual metric improvement.
+        """
+        model.eval()
+        sam_model      = get_sam_model(model)
+        model_dtype    = torch.bfloat16
+        anchor_indices = get_anchor_indices(10, ltpo_cfg.num_anchors)
+        per_sample = []
+        _total = min(max_rows, len(dataloader)) if max_rows > 0 else len(dataloader)
+        for i, batch in enumerate(
+            tqdm(dataloader, desc="Correlation study (s1)", total=_total)
+        ):
+            if 0 < max_rows <= i:
+                break
+            input_dict = dict_to_cuda(batch)
+            with torch.cuda.amp.autocast(dtype=torch.bfloat16, enabled=True):
+                with torch.no_grad():
+                    output_dict = model.forward(
+                        images=input_dict["images"],
+                        images_clip=input_dict["images_clip"],
+                        audio_features=input_dict["audio_feats"],
+                        image_features=input_dict["image_feats"],
+                        input_ids=input_dict["input_ids"],
+                        labels=input_dict["labels"],
+                        attention_masks=input_dict["attention_masks"],
+                        masks_list=input_dict["masks"],
+                        resize_list=input_dict["resizes"],
+                        orgsize_list=input_dict["orgsizes"],
+                        conversation_list=input_dict["convs"],
+                        refs_num=input_dict["refs_num"],
+                        fids=input_dict["fids"],
+                        vids=input_dict["vids"],
+                        contrast=args.ct_weight,
+                        ref_ids=input_dict["ref_ids"],
+                        inference=True,
+                    )
+            gt_masks     = output_dict["gt_masks"]       # list[B]:[num_seg, T, H, W]
+            seg_emb_list = output_dict["seg_embeddings"] # list[B]:[num_seg, 256]
+            for b in range(len(input_dict["images"])):
+                image_embeds_b = input_dict["image_feats"][b]
+                resize_b       = input_dict["resizes"][b]
+                orgsize_b      = input_dict["orgsizes"][b]
+                F_init_b       = seg_emb_list[b].detach().float()
+                for seg_idx in range(F_init_b.shape[0]):
+                    q_init  = F_init_b[seg_idx : seg_idx + 1]          # [1, 256]
+                    gt_seg  = gt_masks[b][seg_idx : seg_idx + 1]        # [1, T, H, W]
+                    # Baseline decode (q_init, no LTPO)
+                    with torch.no_grad():
+                        pred_base = decode_full_video(
+                            q_init, image_embeds_b, sam_model,
+                            resize_b, orgsize_b, model_dtype,
+                        ).unsqueeze(0)                                  # [1, T, H, W]
+                    iou_base = utility.mask_iou(pred_base, gt_seg)
+                    f_base   = utility.Eval_Fmeasure(pred_base, gt_seg, None)
+                    # LTPO (s1) — also computes r_avt inside q_ltpo_autograd
+                    reset_q_ltpo_stats()
+                    q_best = q_ltpo_autograd(
+                        q_init, image_embeds_b, anchor_indices,
+                        sam_model, model_dtype, ltpo_cfg,
+                    )
+                    stat = get_q_ltpo_stats()[0]
+                    with torch.no_grad():
+                        pred_ltpo = decode_full_video(
+                            q_best, image_embeds_b, sam_model,
+                            resize_b, orgsize_b, model_dtype,
+                        ).unsqueeze(0)
+                    iou_ltpo = utility.mask_iou(pred_ltpo, gt_seg)
+                    f_ltpo   = utility.Eval_Fmeasure(pred_ltpo, gt_seg, None)
+                    per_sample.append({
+                        "reward_gain":  stat["reward_gain"],
+                        "r_avt_gain":   stat.get("r_avt_gain",   0.0),
+                        "r_avt_c_gain": stat.get("r_avt_c_gain", 0.0),
+                        "e0":           stat["e0"],
+                        "accepted":     stat["accepted"],
+                        "delta_miou":   float(iou_ltpo  - iou_base),
+                        "delta_f":      float(f_ltpo    - f_base),
+                    })
+        _print_correlation_report(per_sample)
     # ── Stage 0: gradient connectivity check ─────────────────────────────
     # Loads one image_embed directly from disk — no dataloader, no gt_mask,
     # no media frames required.  F_init is a unit-scale random vector that
     # ── Run evaluation ────────────────────────────────────────────────────
+    ltpo_cfg          = LTPOConfig()
+    q_ltpo_cfg_s1     = QLTPOConfig(stage=1)
+    q_ltpo_cfg_s2     = QLTPOConfig(stage=2)
+    q_ltpo_cfg_s21    = QLTPOConfig(stage=21)   # P1a: tether probe
+    q_ltpo_cfg_s22    = QLTPOConfig(stage=22)   # P1b: faithful ext-ref
+    # ── Direction B: boundary precision probes ──────────────────────────────
+    q_ltpo_cfg_b1_w03 = QLTPOConfig(stage=1, lambda_area_inc=0.3, area_inc_tau=0.0)
+    q_ltpo_cfg_b1_w10 = QLTPOConfig(stage=1, lambda_area_inc=1.0, area_inc_tau=0.0)
+    # ── Direction II: Frame-adaptive token optimization ─────────────────────
+    # fa_c03: delta clipped at 0.3×‖q_init‖ — moderate constraint.
+    #   First probe to answer: "does constrained frame-adaptive beat shared q?"
+    #   If yes → ablate tighter/looser constraints and smoothness in follow-up.
+    q_ltpo_cfg_fa_c03  = QLTPOConfig(stage=1, lambda_residual=0.001, lambda_smooth_temp=0.0, max_delta_drift_scale=0.3)
+    max_rows          = args.max_eval_rows       # -1 = all rows
     # --max_eval_rows 0  → Stage 0 + bypass equivalence check, then exit
     if max_rows == 0:
         run_stage0_check()
         run_bypass_test()
     elif _split == 'test_n':
+        # Null safety check: baseline + Stage 1 + frame-adaptive
         valuate_Null(model, _dataloader, max_rows=max_rows)
+        for cfg_name, cfg in [("s1", q_ltpo_cfg_s1)]:
+            reset_q_ltpo_stats()
+            valuate_ltpo_null(model, _dataloader, cfg,
+                              optimize_fn=q_ltpo_autograd, max_rows=max_rows)
+            print_q_ltpo_stats(f"null_q_ltpo_{cfg_name}")
         reset_q_ltpo_stats()
+        valuate_ltpo_adaptive(model, _dataloader, "null_fa_c03",
+                              q_ltpo_cfg_fa_c03, max_rows=max_rows)
+        print_q_ltpo_stats("null_fa_c03")
     else:
         valuate(model, _dataloader, _split, max_rows=max_rows)
+        # Step A0: reward–metric correlation study (s1 + AVT proxy signals)
+        valuate_ltpo_correlation_study(
+            model, _dataloader, q_ltpo_cfg_s1, max_rows=max_rows
+        )

seg_ltpo.py CHANGED Viewed

@@ -283,31 +283,98 @@ def best_of_2_optimize(
 # Full-video decode with a given Fseg
 # ---------------------------------------------------------------------------
 def decode_full_video(
-    fseg: torch.Tensor,          # [1, 256] float32
-    image_embeds: torch.Tensor,  # [T, 256, 64, 64] model dtype on CUDA
     sam_model,
-    resize: tuple,               # (H_resized, W_resized) – after ResizeLongestSide
-    orgsize: tuple,              # (H_orig, W_orig)
     model_dtype: torch.dtype,
 ) -> torch.Tensor:
-    """
-    Decode all T frames with the given Fseg.
     Returns raw logit mask [T, H_orig, W_orig] (not yet sigmoid).
     """
-    device = image_embeds.device
     dense_emb = _precompute_dense_emb(sam_model, model_dtype, device)
     dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device)
     sparse_emb = fseg.to(model_dtype).unsqueeze(1)  # [1, 1, 256]
     with torch.no_grad():
-        low_res_masks, _ = sam_model.mask_decoder(
-            image_embeddings=image_embeds,      # [T, 256, 64, 64]
             image_pe=dense_pe,
-            sparse_prompt_embeddings=sparse_emb,  # [1, 1, 256]
-            dense_prompt_embeddings=dense_emb,    # [1, 256, 64, 64]
-            multimask_output=False,
-        )  # [T, 1, 256, 256]
     pred_mask = sam_model.postprocess_masks(
         low_res_masks, input_size=resize, original_size=orgsize
@@ -401,12 +468,14 @@ def ltpo_optimize(
 @dataclass
 class QLTPOConfig:
-    """Configuration for q_ltpo_autograd (Stages 1–3).
     stage controls which reward terms are active:
-      1  R_iou + R_area_soft + reg            (gradient connectivity + stability)
-      2  Stage 1 + R_align_det (z stopgrad)   (semantic alignment)
-      3  Stage 2 + R_temp_feat                (full reward)
     """
     stage: int = 1
     T: int = 5
@@ -443,12 +512,44 @@ class QLTPOConfig:
     e0_modulation: str = "identity"
     e0_eps: float = 1e-4   # epsilon for "sqrt" variant
     # ── Oracle Null-safety gate (analysis only; NOT for final method) ──────
     # Derived from test-set distribution (Null area_hard ≈ 0.01, Seen ≈ 0.05)
     # so must not be used in reported results.  Set null_gate_delta=0 to disable.
     null_area_threshold: float = 0.02   # hard area fraction below which guard activates
     null_gate_delta: float = 0.0        # 0 = disabled; 0.05 = oracle experiment
 # ---------------------------------------------------------------------------
 # e0 helper
@@ -508,10 +609,32 @@ def _task_reward_stage1(
     optimizer sees only the area-penalty gradient and naturally tends toward
     smaller (more conservative) masks — the correct behavior when the initial
     prediction is near-empty (Null frames).
     """
     r_iou  = iou.mean()
     r_area = torch.sigmoid(lrm / cfg.area_temp).mean()
-    return cfg.lambda_iou * e0 * r_iou - cfg.lambda_area * r_area
 def _task_reward_stage2(
@@ -575,6 +698,167 @@ def _task_reward_stage3(
     return r_s2 + cfg.lambda_temp * r_temp
 def _compute_task_reward(
     q: torch.Tensor,
     lrm: torch.Tensor,
@@ -582,12 +866,20 @@ def _compute_task_reward(
     image_embeds_anchor_fp32: torch.Tensor,
     cfg: QLTPOConfig,
     e0: float = 1.0,
 ) -> torch.Tensor:
     """Dispatch to the correct stage's task reward."""
     if cfg.stage == 1:
         return _task_reward_stage1(lrm, iou, cfg, e0)
     if cfg.stage == 2:
         return _task_reward_stage2(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0)
     return _task_reward_stage3(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0)
@@ -599,9 +891,11 @@ def _compute_full_reward(
     q_init: torch.Tensor,
     cfg: QLTPOConfig,
     e0: float = 1.0,
 ) -> torch.Tensor:
     """Full reward = task reward + L2 regularization (used for backward)."""
-    r_task = _compute_task_reward(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0)
     r_reg  = (q - q_init).pow(2).sum()
     return r_task - cfg.lambda_reg * r_reg
@@ -661,6 +955,53 @@ def check_grad_connectivity(
     }
 # ---------------------------------------------------------------------------
 # Stage 1–3: q-LTPO-autograd main optimizer
 # ---------------------------------------------------------------------------
@@ -697,6 +1038,11 @@ def q_ltpo_autograd(
     lr        = cfg.lr        if cfg.lr        > 0 else 0.01 * rms.item()
     max_drift = cfg.max_drift if cfg.max_drift > 0 else 0.5  * q_init_fp32.norm().item()
     # ── Baseline forward + e0 existence prior ────────────────────────────
     with torch.no_grad():
         lrm0, iou0 = _decode_on_anchors_diff(
@@ -708,7 +1054,8 @@ def q_ltpo_autograd(
         e0 = _compute_e0(r_area_soft_init, cfg)
         R_init_task = _compute_task_reward(
-            q_init_fp32, lrm0, iou0, image_embeds_anchor, cfg, e0=e0
         ).item()
     # ── Optimisation setup ────────────────────────────────────────────────
@@ -720,13 +1067,17 @@ def q_ltpo_autograd(
     hit_clip    = False
     # ── Optimisation loop ─────────────────────────────────────────────────
     for step in range(cfg.T):
         optimizer.zero_grad()
         lrm, iou = _decode_on_anchors_diff(
             q, image_embeds_anchor, dense_emb, mask_dec, dense_pe
         )
-        R_full = _compute_full_reward(q, lrm, iou, image_embeds_anchor, q_init_fp32, cfg, e0=e0)
         R_full.backward()
         optimizer.step()
@@ -744,20 +1095,32 @@ def q_ltpo_autograd(
             lrm_eval, iou_eval = _decode_on_anchors_diff(
                 q.detach(), image_embeds_anchor, dense_emb, mask_dec, dense_pe
             )
             r_task = _compute_task_reward(
-                q.detach(), lrm_eval, iou_eval, image_embeds_anchor, cfg, e0=e0
             ).item()
             if r_task > best_reward:
                 best_reward = r_task
                 best_q = q.detach().clone()
     # ── Reward gating: clean re-eval of best_q vs q_init ─────────────────
     with torch.no_grad():
         lrm_b, iou_b = _decode_on_anchors_diff(
             best_q, image_embeds_anchor, dense_emb, mask_dec, dense_pe
         )
         R_best_task = _compute_task_reward(
-            best_q, lrm_b, iou_b, image_embeds_anchor, cfg, e0=e0
         ).item()
     area_init = (lrm0 > 0).float().mean().item()
@@ -768,19 +1131,242 @@ def q_ltpo_autograd(
     )
     accepted = R_best_task > R_init_task + effective_gate
     # ── Per-sample diagnostics ────────────────────────────────────────────
     _q_ltpo_stats.append({
-        "accepted":         accepted,
-        "reward_gain":      R_best_task - R_init_task,
-        "drift":            (best_q - q_init_fp32).norm().item(),
-        "hit_clip":         hit_clip,
-        "e0":               e0,
-        "R_iou_pred_init":  iou0.mean().item(),
-        "R_iou_pred_best":  iou_b.mean().item(),
-        "area_hard_init":   area_init,
-        "area_hard_best":   (lrm_b > 0).float().mean().item(),
     })
     if not accepted:
         return F_init.float()
     return best_q

 # Full-video decode with a given Fseg
 # ---------------------------------------------------------------------------
+def _sobel_edge(rgb_frames: torch.Tensor) -> torch.Tensor:
+    """Compute Sobel edge magnitude from normalized RGB frames.
+    Args:
+        rgb_frames: [T, 3, H, W] float32 (SAM-normalized, CUDA)
+    Returns:
+        edge: [T, 1, H, W] float32, non-negative
+    """
+    gray = rgb_frames.float().mean(dim=1, keepdim=True)  # [T, 1, H, W]
+    kx = torch.tensor([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]],
+                      dtype=torch.float32, device=rgb_frames.device).view(1, 1, 3, 3)
+    ky = kx.transpose(2, 3)
+    gx = F.conv2d(gray, kx, padding=1)
+    gy = F.conv2d(gray, ky, padding=1)
+    return torch.sqrt(gx ** 2 + gy ** 2 + 1e-6)  # [T, 1, H, W]
+def _boundary_edge_score(
+    low_res_masks: torch.Tensor,   # [T, K, 256, 256] logits
+    rgb_frames: torch.Tensor,      # [T, 3, H, W] float32
+    resize: tuple,                 # (H_resized, W_resized)
+    area_temp: float = 5.0,
+) -> torch.Tensor:
+    """Score each of K mask candidates by boundary-edge alignment.
+    R_edge = <soft_boundary_band, Sobel_edge> / (sum(soft_boundary_band) + ε)
+    Rewards masks whose boundaries coincide with image edges.
+    Returns: [T, K] float32 scores (higher = better boundary alignment)
+    """
+    T, K = low_res_masks.shape[:2]
+    H_r, W_r = resize
+    # Upsample all candidates to resized image resolution at once
+    masks_up = F.interpolate(
+        low_res_masks.reshape(T * K, 1, 256, 256).float(),
+        size=(H_r, W_r), mode="bilinear", align_corners=False,
+    ).reshape(T, K, H_r, W_r)  # [T, K, H, W]
+    E = _sobel_edge(rgb_frames[:, :, :H_r, :W_r])  # [T, 1, H, W]
+    m  = torch.sigmoid(masks_up / area_temp)                     # [T, K, H, W]
+    b  = 4.0 * m * (1.0 - m)                                    # soft boundary band
+    num = (b * E.squeeze(1).unsqueeze(1)).sum(dim=[2, 3])        # [T, K]
+    den = b.sum(dim=[2, 3]) + 1e-6
+    return num / den                                             # [T, K]
 def decode_full_video(
+    fseg: torch.Tensor,                    # [1, 256] float32
+    image_embeds: torch.Tensor,            # [T, 256, 64, 64] model dtype on CUDA
     sam_model,
+    resize: tuple,                         # (H_resized, W_resized)
+    orgsize: tuple,                        # (H_orig, W_orig)
     model_dtype: torch.dtype,
+    rgb_frames: Optional[torch.Tensor] = None,  # [T, 3, H, W]; enables edge selection
+    multimask: bool = False,               # True = 3 candidates; False = single mask
 ) -> torch.Tensor:
+    """Decode all T frames with the given Fseg.
+    Selection logic (applied per-frame):
+      - multimask=False, rgb_frames=None : original single-mask decode (baseline)
+      - multimask=True,  rgb_frames=None : 3 candidates, select by SAM iou_pred
+      - multimask=True,  rgb_frames=*   : 3 candidates, select by boundary-edge score
+        (boundary band × Sobel edge; directly rewards boundary-image alignment)
     Returns raw logit mask [T, H_orig, W_orig] (not yet sigmoid).
     """
+    device    = image_embeds.device
     dense_emb = _precompute_dense_emb(sam_model, model_dtype, device)
     dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device)
     sparse_emb = fseg.to(model_dtype).unsqueeze(1)  # [1, 1, 256]
     with torch.no_grad():
+        low_res_masks, iou_preds = sam_model.mask_decoder(
+            image_embeddings=image_embeds,
             image_pe=dense_pe,
+            sparse_prompt_embeddings=sparse_emb,
+            dense_prompt_embeddings=dense_emb,
+            multimask_output=multimask,
+        )  # [T, K, 256, 256], [T, K]  where K=1 or K=3
+    if multimask:
+        T = low_res_masks.shape[0]
+        if rgb_frames is not None:
+            # Step 1b: boundary-edge score selects best candidate
+            scores = _boundary_edge_score(low_res_masks, rgb_frames, resize)
+        else:
+            # Step 1a: SAM's own iou_pred selects best candidate
+            scores = iou_preds
+        best_idx = scores.argmax(dim=1)  # [T]
+        low_res_masks = low_res_masks[torch.arange(T, device=device), best_idx].unsqueeze(1)
     pred_mask = sam_model.postprocess_masks(
         low_res_masks, input_size=resize, original_size=orgsize
 @dataclass
 class QLTPOConfig:
+    """Configuration for q_ltpo_autograd (Stages 1–3 + Stage 2-ext variants).
     stage controls which reward terms are active:
+      1   R_iou + R_area_soft + reg                      (baseline autograd)
+      2   Stage 1 + R_align_det (z_in/z_out stopgrad)   (self-bootstrapped alignment)
+      3   Stage 2 + R_temp_feat                          (full reward)
+      21  Stage 1 + R_tether    (P1a: tether probe)      (frozen r_ref via q_init attn)
+      22  Stage 1 + R_faithful  (P1b: faithful ext-ref)  (z_in/z_out vs frozen r_ref)
     """
     stage: int = 1
     T: int = 5
     e0_modulation: str = "identity"
     e0_eps: float = 1e-4   # epsilon for "sqrt" variant
+    # ── Stage 2-ext: external reference (stages 21 and 22) ────────────────
+    # r_ref = AttnPool(image_feats_anchor, q_init): frozen visual anchor derived
+    # from q_init's attention over SAM image features. Breaks Stage 2's
+    # self-confirming bias by providing a mask-independent teacher.
+    # r_ref_temp: softmax temperature for attention pooling (sqrt(256) = 16).
+    r_ref_temp: float = 16.0
+    # ── Direction B: boundary precision rewards ────────────────────────────
+    # B1: asymmetric area expansion penalty
+    #   Only penalises growth beyond (1+τ)×e0; allows mask contraction.
+    #   Targets the observed pattern where LTPO slightly expands masks into
+    #   non-target regions (recall↑ but precision↓, hurting F-score).
+    # B2: boundary sharpness reward
+    #   -mean(4m(1-m)) with temperature=1.0; rewards bimodal (certain)
+    #   mask predictions, encouraging cleaner boundary predictions.
+    lambda_area_inc: float = 0.0   # B1 weight (0 = disabled)
+    area_inc_tau: float = 0.0      # B1 tolerance band: allow (1+τ)×e0
+    lambda_sharp: float = 0.0      # B2 weight (0 = disabled)
     # ── Oracle Null-safety gate (analysis only; NOT for final method) ──────
     # Derived from test-set distribution (Null area_hard ≈ 0.01, Seen ≈ 0.05)
     # so must not be used in reported results.  Set null_gate_delta=0 to disable.
     null_area_threshold: float = 0.02   # hard area fraction below which guard activates
     null_gate_delta: float = 0.0        # 0 = disabled; 0.05 = oracle experiment
+    # ── Direction II: Frame-adaptive token optimization (stage=4) ─────────
+    # q_t = q_global + delta_t, where delta_t is a per-anchor residual.
+    # Optimizes q_global and {delta_t} jointly with Adam.
+    # lambda_residual: soft L2 penalty on delta_t
+    # lambda_smooth_temp: temporal smoothness penalty on adjacent delta differences
+    # max_delta_drift_scale: per-anchor hard L2 clip = scale × ‖q_init‖
+    #   Prevents individual anchors from wandering to a completely different visual mode.
+    #   Keep << max_drift (0.5) so delta stays a "small frame correction" to q_global.
+    #   0.1 is tight (delta ≤ 20% of global drift budget), 0.3 is moderate.
+    lambda_residual: float = 0.001
+    lambda_smooth_temp: float = 0.0
+    max_delta_drift_scale: float = 0.1   # per-anchor clip = scale × ‖q_init‖
 # ---------------------------------------------------------------------------
 # e0 helper
     optimizer sees only the area-penalty gradient and naturally tends toward
     smaller (more conservative) masks — the correct behavior when the initial
     prediction is near-empty (Null frames).
+    Optional boundary precision terms (Direction B):
+      B1 (lambda_area_inc > 0): asymmetric expansion penalty
+        -λ_inc · ReLU(r_area - (1+τ)·e0)
+        Penalises mask growth beyond the initial area (+ tolerance band τ).
+        e0 doubles as the stopgrad initial-area threshold — zero extra cost.
+      B2 (lambda_sharp > 0): boundary sharpness reward
+        -λ_sharp · mean(4m(1-m))  with m = sigmoid(lrm), temperature=1.0
+        Maximises bimodality of mask logits → cleaner boundary predictions.
     """
     r_iou  = iou.mean()
     r_area = torch.sigmoid(lrm / cfg.area_temp).mean()
+    R = cfg.lambda_iou * e0 * r_iou - cfg.lambda_area * r_area
+    # B1: penalise expansion beyond (1+τ)×e0 (allow contraction freely)
+    if cfg.lambda_area_inc > 0.0:
+        area_ceil = (1.0 + cfg.area_inc_tau) * e0
+        R = R - cfg.lambda_area_inc * F.relu(r_area - area_ceil)
+    # B2: reward confident (bimodal) boundary predictions
+    if cfg.lambda_sharp > 0.0:
+        m_sharp = torch.sigmoid(lrm)            # temperature=1.0 (sharp)
+        boundary_uncertain = 4.0 * m_sharp * (1.0 - m_sharp)
+        R = R - cfg.lambda_sharp * boundary_uncertain.mean()
+    return R
 def _task_reward_stage2(
     return r_s2 + cfg.lambda_temp * r_temp
+@torch.no_grad()
+def _compute_r_ref(
+    q_init: torch.Tensor,               # [1, 256] float32
+    image_embeds_anchor: torch.Tensor,  # [A, 256, 64, 64] float32
+    temp: float = 16.0,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Frozen external visual reference via attention pooling guided by q_init.
+    r_ref: regions most attended by q_init (positive anchor).
+    r_neg: regions least attended by q_init (anti-attended negative).
+    Both are in the SAM 256d space — no projection needed.
+    Computed once before the optimization loop and kept fixed (stopgrad).
+    """
+    img_flat = image_embeds_anchor.flatten(2)          # [A, 256, H*W]
+    q_norm   = F.normalize(q_init[0], dim=0)           # [256]
+    img_norm = F.normalize(img_flat, dim=1)            # [A, 256, H*W]
+    # cosine similarity between q and each spatial position
+    attn = torch.einsum('d,adp->ap', q_norm, img_norm)  # [A, H*W]
+    attn_w_pos = torch.softmax( attn / temp, dim=-1)    # [A, H*W]
+    attn_w_neg = torch.softmax(-attn / temp, dim=-1)    # [A, H*W] anti-attended
+    # soft attention pooling in the original (non-normalized) feature space
+    r_ref_frames = torch.einsum('ap,adp->ad', attn_w_pos, img_flat)  # [A, 256]
+    r_neg_frames = torch.einsum('ap,adp->ad', attn_w_neg, img_flat)  # [A, 256]
+    r_ref = F.normalize(r_ref_frames.mean(0), dim=0)   # [256]
+    r_neg = F.normalize(r_neg_frames.mean(0), dim=0)   # [256]
+    return r_ref, r_neg
+def _task_reward_stage2_tether(
+    q: torch.Tensor,        # [1, 256] float32
+    lrm: torch.Tensor,      # [A,1,256,256] float32
+    iou: torch.Tensor,      # [A,1] float32
+    r_ref: torch.Tensor,    # [256] frozen
+    r_neg: torch.Tensor,    # [256] frozen
+    cfg: QLTPOConfig,
+    e0: float = 1.0,
+) -> torch.Tensor:
+    """Stage 21 (P1a tether): Stage 1 + R_tether.
+    R_tether = cos(q, r_ref) - beta·cos(q, r_neg)
+    q is pulled toward the frozen visual anchor without touching mask features.
+    Tests whether a fixed external reference stabilizes the optimization trajectory.
+    """
+    r_s1    = _task_reward_stage1(lrm, iou, cfg, e0)
+    q_norm  = F.normalize(q[0], dim=0)
+    r_tether = q_norm @ r_ref - cfg.beta_align * (q_norm @ r_neg)
+    return r_s1 + cfg.lambda_align * r_tether
+def _task_reward_stage2_faithful(
+    q: torch.Tensor,                        # [1, 256] float32
+    lrm: torch.Tensor,                      # [A,1,256,256] float32
+    iou: torch.Tensor,                      # [A,1] float32
+    image_embeds_anchor_fp32: torch.Tensor, # [A, 256, 64, 64] float32
+    r_ref: torch.Tensor,                    # [256] frozen
+    cfg: QLTPOConfig,
+    e0: float = 1.0,
+) -> torch.Tensor:
+    """Stage 22 (P1b faithful): Stage 1 + R_faithful.
+    R_faithful = mean_t[ cos(z_in(q,t), r_ref) - beta·cos(z_out(q,t), r_ref) ]
+    z_in/z_out come from the *current* mask (change during optimization), but the
+    teacher r_ref is frozen — breaking Stage 2's self-confirming bias while keeping
+    the same structural form (mask-region vs. reference alignment).
+    """
+    r_s1   = _task_reward_stage1(lrm, iou, cfg, e0)
+    A      = lrm.shape[0]
+    masks_64 = F.interpolate(
+        torch.sigmoid(lrm.squeeze(1) / cfg.area_temp).unsqueeze(1),
+        size=(64, 64), mode="bilinear", align_corners=False,
+    ).squeeze(1)  # [A, 64, 64]
+    r_align = torch.tensor(0.0, device=q.device)
+    for t in range(A):
+        m   = masks_64[t].detach()           # stopgrad on mask weights only
+        img = image_embeds_anchor_fp32[t]    # [256, 64, 64]
+        z_in  = F.normalize((img * m.unsqueeze(0)).sum(dim=[1, 2]) / (m.sum() + 1e-6), dim=0)
+        z_out = F.normalize((img * (1 - m).unsqueeze(0)).sum(dim=[1, 2]) / ((1 - m).sum() + 1e-6), dim=0)
+        # teacher is r_ref (frozen), not z_in itself — no confirmation bias
+        r_align = r_align + z_in @ r_ref - cfg.beta_align * (z_out @ r_ref)
+    r_align = r_align / A
+    return r_s1 + cfg.lambda_align * r_align
+def _decode_on_anchors_diff_adaptive(
+    q_global: torch.Tensor,                  # [1, 256] float32, requires_grad
+    delta: torch.Tensor,                     # [A, 256] float32, requires_grad
+    image_embeds_anchor_fp32: torch.Tensor,  # [A, 256, 64, 64] float32, detached
+    dense_emb_fp32: torch.Tensor,            # [1, 256, 64, 64] float32, detached
+    mask_decoder,
+    dense_pe_fp32: torch.Tensor,             # [1, 256, 64, 64] float32, detached
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Frame-adaptive differentiable decode: each anchor t uses q_t = q_global + delta[t].
+    Loops over A anchors to preserve gradient flow through both q_global and delta.
+    Returns low_res_masks [A,1,256,256] and iou_preds [A,1], both float32.
+    """
+    A = image_embeds_anchor_fp32.shape[0]
+    lrm_list: List[torch.Tensor] = []
+    iou_list: List[torch.Tensor] = []
+    for t in range(A):
+        q_t = q_global + delta[t : t + 1]      # [1, 256]
+        sparse_emb = q_t.unsqueeze(1)           # [1, 1, 256]
+        lrm_t, iou_t = mask_decoder(
+            image_embeddings=image_embeds_anchor_fp32[t : t + 1],
+            image_pe=dense_pe_fp32,
+            sparse_prompt_embeddings=sparse_emb,
+            dense_prompt_embeddings=dense_emb_fp32,
+            multimask_output=False,
+        )  # [1,1,256,256], [1,1]
+        lrm_list.append(lrm_t)
+        iou_list.append(iou_t)
+    return torch.cat(lrm_list, dim=0), torch.cat(iou_list, dim=0)  # [A,1,256,256], [A,1]
+def _task_reward_frame_adaptive(
+    lrm: torch.Tensor,   # [A, 1, 256, 256] float32
+    iou: torch.Tensor,   # [A, 1] float32
+    cfg: "QLTPOConfig",
+    e0_vec: List[float],  # per-anchor existence priors [A]
+) -> torch.Tensor:
+    """Per-anchor task reward averaged over anchors (no regularization)."""
+    A = lrm.shape[0]
+    R = torch.tensor(0.0, device=lrm.device)
+    for t in range(A):
+        r_iou_t  = iou[t].mean()
+        r_area_t = torch.sigmoid(lrm[t] / cfg.area_temp).mean()
+        R = R + cfg.lambda_iou * e0_vec[t] * r_iou_t - cfg.lambda_area * r_area_t
+    return R / A
+def _compute_full_reward_adaptive(
+    q_global: torch.Tensor,   # [1, 256]
+    delta: torch.Tensor,      # [A, 256]
+    lrm: torch.Tensor,        # [A, 1, 256, 256]
+    iou: torch.Tensor,        # [A, 1]
+    q_init: torch.Tensor,     # [1, 256] detached
+    cfg: "QLTPOConfig",
+    e0_vec: List[float],
+) -> torch.Tensor:
+    """Full adaptive reward = task + residual penalty + temporal smoothness + L2 reg."""
+    r_task   = _task_reward_frame_adaptive(lrm, iou, cfg, e0_vec)
+    r_delta  = delta.pow(2).sum()
+    r_reg    = (q_global - q_init).pow(2).sum()
+    R = r_task - cfg.lambda_residual * r_delta - cfg.lambda_reg * r_reg
+    A = delta.shape[0]
+    if A > 1 and cfg.lambda_smooth_temp > 0.0:
+        r_smooth = torch.tensor(0.0, device=delta.device)
+        for t in range(A - 1):
+            r_smooth = r_smooth + (delta[t] - delta[t + 1]).pow(2).sum()
+        R = R - cfg.lambda_smooth_temp * r_smooth / (A - 1)
+    return R
 def _compute_task_reward(
     q: torch.Tensor,
     lrm: torch.Tensor,
     image_embeds_anchor_fp32: torch.Tensor,
     cfg: QLTPOConfig,
     e0: float = 1.0,
+    r_ref: Optional[torch.Tensor] = None,
+    r_neg: Optional[torch.Tensor] = None,
 ) -> torch.Tensor:
     """Dispatch to the correct stage's task reward."""
     if cfg.stage == 1:
         return _task_reward_stage1(lrm, iou, cfg, e0)
     if cfg.stage == 2:
         return _task_reward_stage2(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0)
+    if cfg.stage == 21:
+        assert r_ref is not None and r_neg is not None, "stage 21 requires r_ref/r_neg"
+        return _task_reward_stage2_tether(q, lrm, iou, r_ref, r_neg, cfg, e0)
+    if cfg.stage == 22:
+        assert r_ref is not None, "stage 22 requires r_ref"
+        return _task_reward_stage2_faithful(q, lrm, iou, image_embeds_anchor_fp32, r_ref, cfg, e0)
     return _task_reward_stage3(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0)
     q_init: torch.Tensor,
     cfg: QLTPOConfig,
     e0: float = 1.0,
+    r_ref: Optional[torch.Tensor] = None,
+    r_neg: Optional[torch.Tensor] = None,
 ) -> torch.Tensor:
     """Full reward = task reward + L2 regularization (used for backward)."""
+    r_task = _compute_task_reward(q, lrm, iou, image_embeds_anchor_fp32, cfg, e0, r_ref, r_neg)
     r_reg  = (q - q_init).pow(2).sum()
     return r_task - cfg.lambda_reg * r_reg
     }
+# ---------------------------------------------------------------------------
+# AVT proxy reward (Step A0: reward–metric correlation study)
+# ---------------------------------------------------------------------------
+@torch.no_grad()
+def _compute_avt_proxy_reward(
+    q_init_fp32: torch.Tensor,               # [1, 256] — frozen AVT anchor (= Fseg)
+    lrm: torch.Tensor,                        # [A, 1, 256, 256] float32
+    image_embeds_anchor_fp32: torch.Tensor,   # [A, 256, 64, 64] float32
+    cfg: "QLTPOConfig",
+    beta: float = 0.5,
+) -> Tuple[float, float]:
+    """Task-specific proxy reward using frozen q_init (Fseg) as teacher.
+    q_init = Fseg is already the audio+video+text fusion token produced by SimToken.
+    Using it as a frozen reference breaks Stage 2's self-confirming bias while
+    measuring whether the mask region aligns with the correct referent.
+    Returns:
+        R_avt   = mean_t cos(z_in_t, q_init)                          [scalar]
+        R_avt_c = mean_t [cos(z_in_t, q_init) - beta·cos(z_out_t, q_init)]  [scalar]
+    """
+    A = lrm.shape[0]
+    q_norm = F.normalize(q_init_fp32[0], dim=0)  # [256]
+    masks_64 = F.interpolate(
+        torch.sigmoid(lrm.squeeze(1) / cfg.area_temp).unsqueeze(1),
+        size=(64, 64), mode="bilinear", align_corners=False,
+    ).squeeze(1)  # [A, 64, 64]
+    r_avt, r_avt_c = 0.0, 0.0
+    for t in range(A):
+        m   = masks_64[t]
+        img = image_embeds_anchor_fp32[t]
+        z_in  = F.normalize(
+            (img * m.unsqueeze(0)).sum(dim=[1, 2]) / (m.sum() + 1e-6), dim=0
+        )
+        z_out = F.normalize(
+            (img * (1.0 - m).unsqueeze(0)).sum(dim=[1, 2]) / ((1.0 - m).sum() + 1e-6), dim=0
+        )
+        c_in  = (q_norm @ z_in).item()
+        c_out = (q_norm @ z_out).item()
+        r_avt   += c_in
+        r_avt_c += c_in - beta * c_out
+    return r_avt / A, r_avt_c / A
 # ---------------------------------------------------------------------------
 # Stage 1–3: q-LTPO-autograd main optimizer
 # ---------------------------------------------------------------------------
     lr        = cfg.lr        if cfg.lr        > 0 else 0.01 * rms.item()
     max_drift = cfg.max_drift if cfg.max_drift > 0 else 0.5  * q_init_fp32.norm().item()
+    # ── Precompute frozen external reference (stages 21, 22 only) ────────
+    r_ref, r_neg = None, None
+    if cfg.stage in (21, 22):
+        r_ref, r_neg = _compute_r_ref(q_init_fp32, image_embeds_anchor, cfg.r_ref_temp)
     # ── Baseline forward + e0 existence prior ────────────────────────────
     with torch.no_grad():
         lrm0, iou0 = _decode_on_anchors_diff(
         e0 = _compute_e0(r_area_soft_init, cfg)
         R_init_task = _compute_task_reward(
+            q_init_fp32, lrm0, iou0, image_embeds_anchor, cfg, e0=e0,
+            r_ref=r_ref, r_neg=r_neg,
         ).item()
     # ── Optimisation setup ────────────────────────────────────────────────
     hit_clip    = False
     # ── Optimisation loop ─────────────────────────────────────────────────
+    # Track per-step soft area to diagnose whether B1 penalty ever activates.
+    _step_soft_areas: List[float] = []
     for step in range(cfg.T):
         optimizer.zero_grad()
         lrm, iou = _decode_on_anchors_diff(
             q, image_embeds_anchor, dense_emb, mask_dec, dense_pe
         )
+        R_full = _compute_full_reward(q, lrm, iou, image_embeds_anchor, q_init_fp32, cfg, e0=e0,
+                                      r_ref=r_ref, r_neg=r_neg)
         R_full.backward()
         optimizer.step()
             lrm_eval, iou_eval = _decode_on_anchors_diff(
                 q.detach(), image_embeds_anchor, dense_emb, mask_dec, dense_pe
             )
+            # Record soft area at this step for B1 activation diagnosis
+            _step_soft_areas.append(
+                torch.sigmoid(lrm_eval / cfg.area_temp).mean().item()
+            )
             r_task = _compute_task_reward(
+                q.detach(), lrm_eval, iou_eval, image_embeds_anchor, cfg, e0=e0,
+                r_ref=r_ref, r_neg=r_neg,
             ).item()
             if r_task > best_reward:
                 best_reward = r_task
                 best_q = q.detach().clone()
+    # Peak excess: how much did soft area exceed e0 at its highest point?
+    # b1_peak_excess > 0  ↔  B1 ReLU was non-zero at that step.
+    # b1_peak_excess = 0  ↔  B1 never activated (area stayed below e0 throughout).
+    _max_step_area = max(_step_soft_areas) if _step_soft_areas else r_area_soft_init
+    b1_peak_excess = max(_max_step_area - e0, 0.0)
     # ── Reward gating: clean re-eval of best_q vs q_init ─────────────────
     with torch.no_grad():
         lrm_b, iou_b = _decode_on_anchors_diff(
             best_q, image_embeds_anchor, dense_emb, mask_dec, dense_pe
         )
         R_best_task = _compute_task_reward(
+            best_q, lrm_b, iou_b, image_embeds_anchor, cfg, e0=e0,
+            r_ref=r_ref, r_neg=r_neg,
         ).item()
     area_init = (lrm0 > 0).float().mean().item()
     )
     accepted = R_best_task > R_init_task + effective_gate
+    # ── Mask soft-IoU: how much did the mask actually change? ─────────────
+    # Answers whether q-drift translated into mask change, or fell in a
+    # flat direction of the mask decoder manifold.
+    with torch.no_grad():
+        m0 = torch.sigmoid(lrm0 / cfg.area_temp).squeeze(1)   # [A,256,256]
+        mb = torch.sigmoid(lrm_b / cfg.area_temp).squeeze(1)   # [A,256,256]
+        inter = (m0 * mb).sum(dim=[1, 2])
+        union = (m0 + mb - m0 * mb).sum(dim=[1, 2])
+        mask_soft_iou = (inter / (union + 1e-6)).mean().item()
+        # Soft area at best_q — tracks whether B1 asymmetric penalty worked
+        r_area_soft_best = mb.mean().item()  # sigmoid(lrm_b/area_temp).mean()
+    # Reward decomposition: iou contribution to reward gain
+    R_iou_contrib_gain = (
+        cfg.lambda_iou * e0 * (iou_b.mean().item() - iou0.mean().item())
+    )
+    # AVT proxy reward (Step A0 correlation study)
+    r_avt_init, r_avt_c_init = _compute_avt_proxy_reward(
+        q_init_fp32, lrm0, image_embeds_anchor, cfg
+    )
+    r_avt_best, r_avt_c_best = _compute_avt_proxy_reward(
+        q_init_fp32, lrm_b, image_embeds_anchor, cfg
+    )
     # ── Per-sample diagnostics ────────────────────────────────────────────
     _q_ltpo_stats.append({
+        "accepted":           accepted,
+        "reward_gain":        R_best_task - R_init_task,
+        "drift":              (best_q - q_init_fp32).norm().item(),
+        "hit_clip":           hit_clip,
+        "e0":                 e0,
+        "R_iou_pred_init":    iou0.mean().item(),
+        "R_iou_pred_best":    iou_b.mean().item(),
+        "area_hard_init":     area_init,
+        "area_hard_best":     (lrm_b > 0).float().mean().item(),
+        "r_area_soft_init":   r_area_soft_init,
+        "r_area_soft_best":   r_area_soft_best,
+        "b1_peak_excess":     b1_peak_excess,
+        "mask_soft_iou":      mask_soft_iou,
+        "R_iou_contrib_gain": R_iou_contrib_gain,
+        # AVT proxy: frozen q_init as teacher — task-specific alignment
+        "r_avt_init":         r_avt_init,
+        "r_avt_best":         r_avt_best,
+        "r_avt_gain":         r_avt_best - r_avt_init,
+        "r_avt_c_init":       r_avt_c_init,
+        "r_avt_c_best":       r_avt_c_best,
+        "r_avt_c_gain":       r_avt_c_best - r_avt_c_init,
     })
     if not accepted:
         return F_init.float()
     return best_q
+# ===========================================================================
+# Direction II: Frame-adaptive token optimization (stage=4)
+# q_t = q_global + delta_t  — shared global token + per-anchor residual
+# ===========================================================================
+def q_ltpo_frame_adaptive(
+    F_init: torch.Tensor,         # [1, 256] any dtype on CUDA
+    image_embeds: torch.Tensor,   # [T, 256, 64, 64] any dtype on CUDA
+    anchor_indices: List[int],
+    sam_model,
+    model_dtype: torch.dtype,
+    cfg: QLTPOConfig,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Frame-adaptive q-LTPO: optimize q_global and per-anchor delta jointly.
+    Each anchor frame t gets its own token q_t = q_global + delta_t.
+    delta_t is initialized to zero so q_t starts equal to q_init for all frames.
+    Per-frame existence priors e0_t suppress optimization on near-empty anchors.
+    Returns:
+        q_global [1, 256] float32  — shared global token
+        delta    [A, 256] float32  — per-anchor residuals (zero if not accepted)
+    """
+    device = F_init.device
+    A = len(anchor_indices)
+    q_init_fp32 = F_init.float().detach()
+    image_embeds_anchor = image_embeds[anchor_indices].float().detach()
+    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device).float().detach()
+    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device).float().detach()
+    mask_dec  = sam_model.mask_decoder
+    rms             = q_init_fp32.norm() / (q_init_fp32.numel() ** 0.5)
+    lr              = cfg.lr       if cfg.lr       > 0 else 0.01 * rms.item()
+    max_drift       = cfg.max_drift if cfg.max_drift > 0 else 0.5  * q_init_fp32.norm().item()
+    max_delta_drift = cfg.max_delta_drift_scale * q_init_fp32.norm().item()
+    # ── Baseline: per-anchor e0 existence priors ────────────────────────────
+    with torch.no_grad():
+        lrm0, iou0 = _decode_on_anchors_diff(
+            q_init_fp32, image_embeds_anchor, dense_emb, mask_dec, dense_pe
+        )
+        e0_vec: List[float] = []
+        for t in range(A):
+            e0_t = torch.sigmoid(lrm0[t] / cfg.area_temp).mean().item()
+            e0_vec.append(_compute_e0(e0_t, cfg))
+        e0_global = sum(e0_vec) / A
+        R_init_task = _task_reward_frame_adaptive(lrm0, iou0, cfg, e0_vec).item()
+    # ── Setup optimization ───────────────────────────────────────────────────
+    q_global = torch.nn.Parameter(q_init_fp32.clone())
+    delta    = torch.nn.Parameter(torch.zeros(A, 256, device=device, dtype=torch.float32))
+    optimizer = torch.optim.Adam([q_global, delta], lr=lr, maximize=True)
+    best_q_global = q_global.detach().clone()
+    best_delta    = delta.detach().clone()
+    best_reward   = R_init_task
+    hit_clip      = False
+    # ── Optimization loop ────────────────────────────────────────────────────
+    for step in range(cfg.T):
+        optimizer.zero_grad()
+        lrm, iou = _decode_on_anchors_diff_adaptive(
+            q_global, delta, image_embeds_anchor, dense_emb, mask_dec, dense_pe
+        )
+        R_full = _compute_full_reward_adaptive(
+            q_global, delta, lrm, iou, q_init_fp32, cfg, e0_vec
+        )
+        R_full.backward()
+        optimizer.step()
+        # Clip q_global and each per-anchor delta within trust regions
+        with torch.no_grad():
+            diff = q_global - q_init_fp32
+            d = diff.norm()
+            if d > max_drift:
+                q_global.copy_(q_init_fp32 + diff * (max_drift / d))
+                hit_clip = True
+            for t in range(A):
+                dn = delta[t].norm()
+                if dn > max_delta_drift:
+                    delta[t].copy_(delta[t] * (max_delta_drift / dn))
+        # Track best (no_grad re-eval of task reward without reg)
+        with torch.no_grad():
+            lrm_eval, iou_eval = _decode_on_anchors_diff_adaptive(
+                q_global.detach(), delta.detach(),
+                image_embeds_anchor, dense_emb, mask_dec, dense_pe
+            )
+            r_task = _task_reward_frame_adaptive(lrm_eval, iou_eval, cfg, e0_vec).item()
+            if r_task > best_reward:
+                best_reward   = r_task
+                best_q_global = q_global.detach().clone()
+                best_delta    = delta.detach().clone()
+    # ── Gating ───────────────────────────────────────────────────────────────
+    with torch.no_grad():
+        lrm_b, iou_b = _decode_on_anchors_diff_adaptive(
+            best_q_global, best_delta, image_embeds_anchor, dense_emb, mask_dec, dense_pe
+        )
+        R_best_task = _task_reward_frame_adaptive(lrm_b, iou_b, cfg, e0_vec).item()
+    accepted = R_best_task > R_init_task + cfg.gate_delta
+    area_init = (lrm0 > 0).float().mean().item()
+    r_area_soft_init = sum(torch.sigmoid(lrm0[t] / cfg.area_temp).mean().item() for t in range(A)) / A
+    r_area_soft_best = sum(torch.sigmoid(lrm_b[t] / cfg.area_temp).mean().item() for t in range(A)) / A
+    # Actual mask soft-IoU between init and best (per anchor, averaged)
+    m0 = torch.sigmoid(lrm0 / cfg.area_temp).squeeze(1)   # [A,256,256]
+    mb = torch.sigmoid(lrm_b / cfg.area_temp).squeeze(1)   # [A,256,256]
+    inter = (m0 * mb).sum(dim=[1, 2])
+    union = (m0 + mb - m0 * mb).sum(dim=[1, 2])
+    mask_soft_iou_fa = (inter / (union + 1e-6)).mean().item()
+    _q_ltpo_stats.append({
+        "accepted":           accepted,
+        "reward_gain":        R_best_task - R_init_task,
+        "drift":              (best_q_global - q_init_fp32).norm().item(),
+        "delta_norm":         best_delta.norm().item(),
+        "hit_clip":           hit_clip,
+        "e0":                 e0_global,
+        "R_iou_pred_init":    iou0.mean().item(),
+        "R_iou_pred_best":    iou_b.mean().item(),
+        "area_hard_init":     area_init,
+        "area_hard_best":     (lrm_b > 0).float().mean().item(),
+        "r_area_soft_init":   r_area_soft_init,
+        "r_area_soft_best":   r_area_soft_best,
+        "b1_peak_excess":     0.0,
+        "mask_soft_iou":      mask_soft_iou_fa,
+        "R_iou_contrib_gain": cfg.lambda_iou * e0_global * (iou_b.mean().item() - iou0.mean().item()),
+    })
+    if not accepted:
+        return q_init_fp32, torch.zeros(A, 256, device=device, dtype=torch.float32)
+    return best_q_global, best_delta
+def decode_full_video_adaptive(
+    q_global: torch.Tensor,       # [1, 256] float32
+    delta: torch.Tensor,          # [A, 256] float32
+    anchor_indices: List[int],
+    image_embeds: torch.Tensor,   # [T, 256, 64, 64] model dtype on CUDA
+    sam_model,
+    resize: tuple,
+    orgsize: tuple,
+    model_dtype: torch.dtype,
+) -> torch.Tensor:
+    """Decode all T frames with frame-adaptive tokens.
+    Each frame is assigned to its nearest anchor by index distance, then decoded
+    with q_t = q_global + delta[anchor_idx].
+    Returns raw logit masks [T, H_orig, W_orig].
+    """
+    T      = image_embeds.shape[0]
+    A      = len(anchor_indices)
+    device = image_embeds.device
+    dense_emb = _precompute_dense_emb(sam_model, model_dtype, device)
+    dense_pe  = sam_model.prompt_encoder.get_dense_pe().to(device)
+    # Nearest-anchor assignment for every frame
+    anchor_arr = torch.tensor(anchor_indices, dtype=torch.float32)
+    frame_to_anchor = [int((anchor_arr - t).abs().argmin().item()) for t in range(T)]
+    pred_masks: List[torch.Tensor] = []
+    with torch.no_grad():
+        for t in range(T):
+            a   = frame_to_anchor[t]
+            q_t = (q_global + delta[a : a + 1]).to(model_dtype)  # [1, 256]
+            sparse_emb = q_t.unsqueeze(1)                         # [1, 1, 256]
+            lrm_t, _ = sam_model.mask_decoder(
+                image_embeddings=image_embeds[t : t + 1],
+                image_pe=dense_pe,
+                sparse_prompt_embeddings=sparse_emb,
+                dense_prompt_embeddings=dense_emb,
+                multimask_output=False,
+            )  # [1, 1, 256, 256]
+            pred_t = sam_model.postprocess_masks(lrm_t, input_size=resize, original_size=orgsize)
+            pred_masks.append(pred_t.squeeze(0).squeeze(0))  # [H, W]
+    return torch.stack(pred_masks, dim=0)  # [T, H_orig, W_orig]