FeilongTang commited on
Commit
36a9b0a
·
1 Parent(s): 14590e3

Codec patch selection demo: visualization + canvas

Browse files

- Replace probe with the codec_tools-style pipeline:
uniform sample -> smart_resize -> per-patch saliency ->
top-K selection -> visualization video + packed canvas.
- Three viz modes: selection (kept-in-color, dropped fade-to-gray),
heatmap (full-frame JET overlay), sbs (side-by-side).
- Saliency: gradient (Sobel), frame_diff (motion), or combined.
- Tunables: time window (start/end sec), top-K, patch size,
max_pixels, log1p scoring, percentile normalization, fade
strength, heatmap blend alpha.
- Designer pass on UI: indigo Soft theme, hero gradient title,
card-grouped controls, prominent Run button, output-priority
layout, footer credit.

Files changed (3) hide show
  1. .gitignore +20 -0
  2. app.py +655 -61
  3. requirements.txt +5 -0
.gitignore ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+
6
+ # Virtualenvs (local dev only)
7
+ .venv/
8
+ venv/
9
+ env/
10
+
11
+ # OS
12
+ .DS_Store
13
+
14
+ # Editors
15
+ .vscode/
16
+ .idea/
17
+
18
+ # Local outputs
19
+ codec_view_outputs/
20
+ *.log
app.py CHANGED
@@ -1,83 +1,677 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import json
 
2
  import os
3
  import shutil
4
  import subprocess
 
 
 
5
 
 
6
  import gradio as gr
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
 
9
- def probe_video(video_path):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  if not video_path:
11
- return "Please upload a video.", None
12
 
13
- if shutil.which("ffprobe") is None:
14
- return (
15
- "ffprobe not found. Add `ffmpeg` to packages.txt at the repo root.",
16
- None,
 
 
 
 
17
  )
18
 
19
- try:
20
- result = subprocess.run(
21
- [
22
- "ffprobe",
23
- "-v", "quiet",
24
- "-print_format", "json",
25
- "-show_format",
26
- "-show_streams",
27
- video_path,
28
- ],
29
- capture_output=True,
30
- text=True,
31
- check=True,
32
  )
33
- except subprocess.CalledProcessError as e:
34
- return f"ffprobe failed:\n{e.stderr}", None
35
-
36
- info = json.loads(result.stdout)
37
- fmt = info.get("format", {}) or {}
38
- streams = info.get("streams", []) or []
39
- v = next((s for s in streams if s.get("codec_type") == "video"), {})
40
- a = next((s for s in streams if s.get("codec_type") == "audio"), {})
41
-
42
- summary = {
43
- "filename": os.path.basename(fmt.get("filename", "")),
44
- "format": fmt.get("format_long_name") or fmt.get("format_name"),
45
- "duration_sec": fmt.get("duration"),
46
- "size_bytes": fmt.get("size"),
47
- "overall_bitrate_bps": fmt.get("bit_rate"),
48
- "video": {
49
- "codec": v.get("codec_name"),
50
- "profile": v.get("profile"),
51
- "width": v.get("width"),
52
- "height": v.get("height"),
53
- "pix_fmt": v.get("pix_fmt"),
54
- "frame_rate": v.get("r_frame_rate"),
55
- "bitrate_bps": v.get("bit_rate"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  },
57
- "audio": {
58
- "codec": a.get("codec_name"),
59
- "sample_rate": a.get("sample_rate"),
60
- "channels": a.get("channels"),
61
- "bitrate_bps": a.get("bit_rate"),
62
  },
 
 
 
 
 
 
 
 
 
 
 
63
  }
64
- return json.dumps(summary, indent=2, ensure_ascii=False), video_path
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
- with gr.Blocks(title="OneVision Encoder Codec View") as demo:
68
- gr.Markdown(
69
- "# OneVision Encoder Codec View\n"
70
- "Upload a video to inspect its container / codec metadata via `ffprobe`."
 
 
 
 
 
71
  )
72
- with gr.Row():
73
- with gr.Column():
74
- video_in = gr.Video(label="Input video", sources=["upload"])
75
- run_btn = gr.Button("Probe", variant="primary")
76
- with gr.Column():
77
- video_out = gr.Video(label="Preview")
78
- info_out = gr.Code(label="Metadata (JSON)", language="json")
79
- run_btn.click(probe_video, inputs=video_in, outputs=[info_out, video_out])
80
 
81
 
82
  if __name__ == "__main__":
83
- demo.launch()
 
1
+ """OneVision Encoder Codec View.
2
+
3
+ A simplified, dependency-light port of the codec_tools pipeline from
4
+ lmms-eval-ov2. The original tool relies on a bitcost-patched ffmpeg 5.1 to
5
+ score every macroblock by its actual encoding bit cost; we approximate that
6
+ saliency signal with a Sobel gradient magnitude per patch (high gradient =
7
+ high local complexity = roughly what the encoder would spend bits on).
8
+
9
+ Pipeline (mirrors codec_tools/pipeline/process_video_bitcost_readiness.py):
10
+ 1. Uniformly sample N frames from the input video.
11
+ 2. smart_resize each frame so dims are multiples of `patch` and the
12
+ total pixel count <= max_pixels.
13
+ 3. Slice every frame into a patch grid; score each patch by its
14
+ Sobel gradient magnitude mean.
15
+ 4. Pick the top-K highest-scoring patches per frame.
16
+ 5. Render a "selection visualization" video: kept patches stay in
17
+ full color, dropped patches are faded to a gray-white wash so the
18
+ viewer can see exactly which patches the codec stage chose.
19
+ 6. Pack the selected patches in time-order, raster scan, into a
20
+ single canvas image (the artifact LLaVA-OneVision2 consumes).
21
+ """
22
+
23
  import json
24
+ import math
25
  import os
26
  import shutil
27
  import subprocess
28
+ import tempfile
29
+ import time
30
+ from typing import List, Tuple
31
 
32
+ import cv2
33
  import gradio as gr
34
+ import imageio_ffmpeg
35
+ import numpy as np
36
+
37
+
38
+ PATCH_CHOICES = [14, 16, 28]
39
+
40
+
41
+ def smart_resize(frame: np.ndarray, max_pixels: int, factor: int) -> np.ndarray:
42
+ """Resize so h,w are multiples of `factor` and h*w <= max_pixels."""
43
+ h, w = frame.shape[:2]
44
+ pixels = h * w
45
+ if pixels > max_pixels:
46
+ scale = math.sqrt(max_pixels / pixels)
47
+ h = max(factor, int(h * scale))
48
+ w = max(factor, int(w * scale))
49
+ h = max(factor, (h // factor) * factor)
50
+ w = max(factor, (w // factor) * factor)
51
+ return cv2.resize(frame, (w, h), interpolation=cv2.INTER_AREA)
52
+
53
+
54
+ def sample_frame_ids(total: int, n: int) -> List[int]:
55
+ if total <= 0:
56
+ return []
57
+ if n >= total:
58
+ return list(range(total))
59
+ return [int(round(i)) for i in np.linspace(0, total - 1, n)]
60
+
61
+
62
+ def decode_frames(video_path: str, frame_ids: List[int]) -> List[np.ndarray]:
63
+ cap = cv2.VideoCapture(video_path)
64
+ if not cap.isOpened():
65
+ return []
66
+ frames: List[np.ndarray] = []
67
+ for fid in frame_ids:
68
+ cap.set(cv2.CAP_PROP_POS_FRAMES, int(fid))
69
+ ok, fr = cap.read()
70
+ if ok:
71
+ frames.append(fr)
72
+ cap.release()
73
+ return frames
74
+
75
+
76
+ def video_metadata(video_path: str) -> dict:
77
+ cap = cv2.VideoCapture(video_path)
78
+ total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
79
+ fps = float(cap.get(cv2.CAP_PROP_FPS) or 0.0)
80
+ w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
81
+ h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
82
+ cap.release()
83
+ meta = {
84
+ "total_frames": total,
85
+ "fps": round(fps, 3),
86
+ "width": w,
87
+ "height": h,
88
+ }
89
+ if shutil.which("ffprobe"):
90
+ try:
91
+ r = subprocess.run(
92
+ [
93
+ "ffprobe", "-v", "quiet", "-select_streams", "v:0",
94
+ "-show_entries", "stream=codec_name,bit_rate,pix_fmt,profile",
95
+ "-of", "json", video_path,
96
+ ],
97
+ capture_output=True, text=True, check=True, timeout=15,
98
+ )
99
+ data = json.loads(r.stdout).get("streams", [{}])[0]
100
+ meta["codec"] = data.get("codec_name")
101
+ meta["pix_fmt"] = data.get("pix_fmt")
102
+ meta["profile"] = data.get("profile")
103
+ meta["bitrate_bps"] = data.get("bit_rate")
104
+ except Exception as e:
105
+ meta["ffprobe_error"] = str(e)
106
+ return meta
107
+
108
+
109
+ def patch_score_grid(frame_bgr: np.ndarray, patch: int) -> np.ndarray:
110
+ """Return [hb, wb] grid of Sobel gradient magnitude means per patch."""
111
+ gray = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2GRAY).astype(np.float32)
112
+ gx = cv2.Sobel(gray, cv2.CV_32F, 1, 0, ksize=3)
113
+ gy = cv2.Sobel(gray, cv2.CV_32F, 0, 1, ksize=3)
114
+ mag = np.sqrt(gx * gx + gy * gy)
115
+ h, w = mag.shape
116
+ hb, wb = h // patch, w // patch
117
+ mag = mag[: hb * patch, : wb * patch]
118
+ grid = mag.reshape(hb, patch, wb, patch).mean(axis=(1, 3))
119
+ return grid.astype(np.float32)
120
+
121
+
122
+ def patch_score_frame_diff(
123
+ prev_bgr: np.ndarray, cur_bgr: np.ndarray, patch: int,
124
+ ) -> np.ndarray:
125
+ """Inter-frame absdiff per patch — proxy for motion / temporal complexity."""
126
+ if prev_bgr is None or prev_bgr.shape != cur_bgr.shape:
127
+ return patch_score_grid(cur_bgr, patch)
128
+ diff = cv2.absdiff(prev_bgr, cur_bgr).mean(axis=2).astype(np.float32)
129
+ h, w = diff.shape
130
+ hb, wb = h // patch, w // patch
131
+ diff = diff[: hb * patch, : wb * patch]
132
+ return diff.reshape(hb, patch, wb, patch).mean(axis=(1, 3))
133
+
134
+
135
+ def compute_score_grids(
136
+ frames: List[np.ndarray], patch: int, signal: str,
137
+ ) -> List[np.ndarray]:
138
+ """Build per-frame patch score grids from one of three signals:
139
+ - 'gradient' — Sobel magnitude only (intra-frame complexity)
140
+ - 'frame_diff' — absdiff vs previous frame (temporal motion)
141
+ - 'combined' — 0.5 * gradient_norm + 0.5 * frame_diff_norm
142
+ For 'combined', each component is independently shifted to [0,1] across
143
+ the whole sample so they contribute on equal footing."""
144
+ sig = (signal or "gradient").lower()
145
+ if sig == "gradient":
146
+ return [patch_score_grid(f, patch) for f in frames]
147
+ if sig == "frame_diff":
148
+ out = []
149
+ prev = None
150
+ for f in frames:
151
+ out.append(patch_score_frame_diff(prev, f, patch))
152
+ prev = f
153
+ return out
154
+ # combined
155
+ g = np.stack([patch_score_grid(f, patch) for f in frames], axis=0)
156
+ d_list = []
157
+ prev = None
158
+ for f in frames:
159
+ d_list.append(patch_score_frame_diff(prev, f, patch))
160
+ prev = f
161
+ d = np.stack(d_list, axis=0)
162
+
163
+ def _norm01(a: np.ndarray) -> np.ndarray:
164
+ a = a.astype(np.float32) - a.min()
165
+ m = a.max()
166
+ return a / m if m > 1e-8 else a
167
+
168
+ combined = 0.5 * _norm01(g) + 0.5 * _norm01(d)
169
+ return [combined[i] for i in range(combined.shape[0])]
170
+
171
+
172
+ def topk_mask(score: np.ndarray, k: int) -> np.ndarray:
173
+ flat = score.flatten()
174
+ if k >= flat.size:
175
+ return np.ones_like(score, dtype=np.uint8)
176
+ if k <= 0:
177
+ return np.zeros_like(score, dtype=np.uint8)
178
+ thresh = np.partition(flat, -k)[-k]
179
+ return (score >= thresh).astype(np.uint8)
180
+
181
+
182
+ def faded_background(frame_bgr: np.ndarray, fade: float = 0.55) -> np.ndarray:
183
+ """Convert to gray-white wash: gray * (1-fade) + white * fade."""
184
+ gray = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2GRAY)
185
+ gray_bgr = cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR).astype(np.float32)
186
+ white = np.full_like(gray_bgr, 255.0)
187
+ out = gray_bgr * (1.0 - fade) + white * fade
188
+ return out.astype(np.uint8)
189
+
190
+
191
+ def overlay_selection(
192
+ frame_bgr: np.ndarray, mask_grid: np.ndarray, patch: int,
193
+ outline: bool = True, fade: float = 0.55,
194
+ ) -> np.ndarray:
195
+ """Composite: kept patches keep color; dropped patches become gray-white.
196
+ Optionally draw a thin outline around kept patches."""
197
+ h, w = frame_bgr.shape[:2]
198
+ hb, wb = mask_grid.shape
199
+ pix_mask = np.kron(mask_grid, np.ones((patch, patch), dtype=np.uint8))
200
+ pix_mask = pix_mask[:h, :w]
201
+ bg = faded_background(frame_bgr, fade=float(fade))
202
+ keep = pix_mask.astype(bool)[..., None]
203
+ out = np.where(keep, frame_bgr, bg)
204
+ if outline:
205
+ for i in range(hb):
206
+ for j in range(wb):
207
+ if mask_grid[i, j]:
208
+ y0, x0 = i * patch, j * patch
209
+ cv2.rectangle(
210
+ out, (x0, y0), (x0 + patch - 1, y0 + patch - 1),
211
+ (0, 220, 255), 1,
212
+ )
213
+ return out
214
+
215
+
216
+ def _normalize_scores(grids: List[np.ndarray], pct: float = 99.0) -> np.ndarray:
217
+ """Stack into [N, hb, wb], shift by per-video min, divide by global pct.
218
+ Using the percentile (instead of max) suppresses outlier patches the same
219
+ way codec_tools does with bitcost_pct=99."""
220
+ arr = np.stack(grids, axis=0).astype(np.float32)
221
+ arr = arr - arr.min()
222
+ cap = np.percentile(arr, pct) if arr.size else 1.0
223
+ if cap <= 1e-8:
224
+ cap = float(arr.max() or 1.0)
225
+ arr = np.clip(arr / cap, 0.0, 1.0)
226
+ return arr
227
 
228
 
229
+ def overlay_heatmap(
230
+ frame_bgr: np.ndarray, score_grid: np.ndarray, patch: int,
231
+ alpha: float = 0.55,
232
+ ) -> np.ndarray:
233
+ """Render a continuous JET heatmap of patch scores blended over the frame.
234
+ Low score = blue, high score = red. `score_grid` is in [0, 1]."""
235
+ h, w = frame_bgr.shape[:2]
236
+ score = (np.clip(score_grid, 0.0, 1.0) * 255.0).astype(np.uint8)
237
+ pix = np.kron(score, np.ones((patch, patch), dtype=np.uint8))
238
+ pix = pix[:h, :w]
239
+ heat = cv2.applyColorMap(pix, cv2.COLORMAP_JET)
240
+ out = cv2.addWeighted(frame_bgr, 1.0 - alpha, heat, alpha, 0.0)
241
+ return out
242
+
243
+
244
+ def overlay_sbs(
245
+ frame_bgr: np.ndarray, mask_grid: np.ndarray, score_grid: np.ndarray,
246
+ patch: int, alpha: float = 0.55, fade: float = 0.55,
247
+ ) -> np.ndarray:
248
+ """Side-by-side: [selection | heatmap] with a thin separator."""
249
+ left = overlay_selection(frame_bgr, mask_grid, patch, outline=True, fade=fade)
250
+ right = overlay_heatmap(frame_bgr, score_grid, patch, alpha=alpha)
251
+ h, w = left.shape[:2]
252
+ sep = np.full((h, 4, 3), 30, dtype=np.uint8)
253
+ sbs = np.concatenate([left, sep, right], axis=1)
254
+ cv2.putText(sbs, "selection", (8, 22), cv2.FONT_HERSHEY_SIMPLEX,
255
+ 0.6, (255, 255, 255), 2, cv2.LINE_AA)
256
+ cv2.putText(sbs, "heatmap", (w + 12, 22), cv2.FONT_HERSHEY_SIMPLEX,
257
+ 0.6, (255, 255, 255), 2, cv2.LINE_AA)
258
+ return sbs
259
+
260
+
261
+ def write_mp4(frames: List[np.ndarray], path: str, fps: float) -> None:
262
+ """Write H.264 mp4 via imageio-ffmpeg's bundled ffmpeg (browser-friendly)."""
263
+ if not frames:
264
+ raise ValueError("no frames to write")
265
+ h, w = frames[0].shape[:2]
266
+ ff = imageio_ffmpeg.get_ffmpeg_exe()
267
+ cmd = [
268
+ ff, "-y", "-loglevel", "error",
269
+ "-f", "rawvideo", "-vcodec", "rawvideo",
270
+ "-s", f"{w}x{h}", "-pix_fmt", "bgr24",
271
+ "-r", f"{fps:.3f}", "-i", "-",
272
+ "-an", "-vcodec", "libx264", "-pix_fmt", "yuv420p",
273
+ "-preset", "veryfast", "-crf", "23",
274
+ "-movflags", "+faststart",
275
+ path,
276
+ ]
277
+ proc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
278
+ try:
279
+ for f in frames:
280
+ if f.shape[0] % 2 or f.shape[1] % 2:
281
+ f = f[: f.shape[0] // 2 * 2, : f.shape[1] // 2 * 2]
282
+ proc.stdin.write(np.ascontiguousarray(f).tobytes())
283
+ proc.stdin.close()
284
+ err = proc.stderr.read().decode("utf-8", errors="ignore")
285
+ rc = proc.wait()
286
+ if rc != 0:
287
+ raise RuntimeError(f"ffmpeg failed (rc={rc}): {err}")
288
+ finally:
289
+ if proc.poll() is None:
290
+ proc.kill()
291
+
292
+
293
+ def pack_canvas(
294
+ frames: List[np.ndarray], masks: List[np.ndarray], patch: int,
295
+ ) -> Tuple[np.ndarray, int]:
296
+ """Collect every selected patch in time-order, raster-scan, into a
297
+ near-square canvas image. Empty slots are white."""
298
+ selected: List[np.ndarray] = []
299
+ for f, m in zip(frames, masks):
300
+ hb, wb = m.shape
301
+ for i in range(hb):
302
+ for j in range(wb):
303
+ if m[i, j]:
304
+ selected.append(
305
+ f[i * patch:(i + 1) * patch, j * patch:(j + 1) * patch].copy()
306
+ )
307
+ n = len(selected)
308
+ if n == 0:
309
+ return np.full((patch, patch, 3), 255, dtype=np.uint8), 0
310
+ cn = int(math.ceil(math.sqrt(n)))
311
+ canvas = np.full((cn * patch, cn * patch, 3), 255, dtype=np.uint8)
312
+ for k, p in enumerate(selected):
313
+ ci, cj = k // cn, k % cn
314
+ canvas[ci * patch:(ci + 1) * patch, cj * patch:(cj + 1) * patch] = p
315
+ return canvas, n
316
+
317
+
318
+ def process(
319
+ video_path,
320
+ sample_frames: int,
321
+ patch_size: int,
322
+ top_k_per_frame: int,
323
+ max_pixels: int,
324
+ viz_mode: str = "selection",
325
+ heatmap_alpha: float = 0.55,
326
+ start_sec: float = 0.0,
327
+ end_sec: float = 0.0,
328
+ saliency_signal: str = "gradient",
329
+ score_log_scale: bool = False,
330
+ bitcost_pct: float = 99.0,
331
+ fade_strength: float = 0.55,
332
+ progress=gr.Progress(track_tqdm=False),
333
+ ):
334
  if not video_path:
335
+ return None, None, "Please upload a video."
336
 
337
+ t0 = time.time()
338
+ progress(0.05, desc="Reading metadata")
339
+ meta = video_metadata(video_path)
340
+ total = meta.get("total_frames") or 0
341
+ if total <= 0:
342
+ return None, None, json.dumps(
343
+ {"error": "Could not read frame count.", "metadata": meta},
344
+ indent=2, ensure_ascii=False,
345
  )
346
 
347
+ progress(0.10, desc="Sampling frames")
348
+ fps = float(meta.get("fps") or 0.0)
349
+ s_sec = max(0.0, float(start_sec or 0.0))
350
+ e_sec = float(end_sec or 0.0)
351
+ if fps > 0 and (s_sec > 0 or e_sec > 0):
352
+ f_start = max(0, int(round(s_sec * fps)))
353
+ f_end = (
354
+ min(total - 1, int(round(e_sec * fps)) - 1)
355
+ if e_sec > 0 else total - 1
 
 
 
 
356
  )
357
+ if f_end <= f_start:
358
+ f_end = total - 1
359
+ window_total = f_end - f_start + 1
360
+ if int(sample_frames) >= window_total:
361
+ fids = list(range(f_start, f_end + 1))
362
+ else:
363
+ fids = [
364
+ int(round(x))
365
+ for x in np.linspace(f_start, f_end, int(sample_frames))
366
+ ]
367
+ else:
368
+ f_start, f_end = 0, total - 1
369
+ fids = sample_frame_ids(total, int(sample_frames))
370
+ raw = decode_frames(video_path, fids)
371
+ if not raw:
372
+ return None, None, json.dumps(
373
+ {"error": "Failed to decode frames.", "metadata": meta},
374
+ indent=2, ensure_ascii=False,
375
+ )
376
+
377
+ progress(0.25, desc="smart_resize")
378
+ resized = [smart_resize(f, int(max_pixels), int(patch_size)) for f in raw]
379
+ th, tw = resized[0].shape[:2]
380
+ resized = [
381
+ cv2.resize(f, (tw, th), interpolation=cv2.INTER_AREA)
382
+ if f.shape[:2] != (th, tw) else f
383
+ for f in resized
384
+ ]
385
+
386
+ progress(0.40, desc=f"Scoring patches ({saliency_signal})")
387
+ grids = compute_score_grids(resized, int(patch_size), saliency_signal)
388
+ if score_log_scale:
389
+ grids = [np.log1p(np.clip(g, 0.0, None)) for g in grids]
390
+ masks = [topk_mask(g, int(top_k_per_frame)) for g in grids]
391
+ norm_scores = _normalize_scores(grids, pct=float(bitcost_pct))
392
+
393
+ mode = (viz_mode or "selection").lower()
394
+ if mode not in ("selection", "heatmap", "sbs"):
395
+ mode = "selection"
396
+ progress(0.60, desc=f"Rendering {mode} video")
397
+ if mode == "heatmap":
398
+ vis = [
399
+ overlay_heatmap(f, s, int(patch_size), alpha=float(heatmap_alpha))
400
+ for f, s in zip(resized, norm_scores)
401
+ ]
402
+ elif mode == "sbs":
403
+ vis = [
404
+ overlay_sbs(
405
+ f, m, s, int(patch_size),
406
+ alpha=float(heatmap_alpha), fade=float(fade_strength),
407
+ )
408
+ for f, m, s in zip(resized, masks, norm_scores)
409
+ ]
410
+ else:
411
+ vis = [
412
+ overlay_selection(f, m, int(patch_size), fade=float(fade_strength))
413
+ for f, m in zip(resized, masks)
414
+ ]
415
+
416
+ out_dir = tempfile.mkdtemp(prefix="codec_view_")
417
+ vis_path = os.path.join(out_dir, f"{mode}_vis.mp4")
418
+ vis_fps = max(2.0, min(8.0, (meta.get("fps") or 25.0) / 4.0))
419
+ write_mp4(vis, vis_path, vis_fps)
420
+
421
+ progress(0.85, desc="Packing canvas")
422
+ canvas, n_selected = pack_canvas(resized, masks, int(patch_size))
423
+ canvas_path = os.path.join(out_dir, "canvas.png")
424
+ cv2.imwrite(canvas_path, canvas)
425
+
426
+ hb, wb = grids[0].shape
427
+ info = {
428
+ "input": meta,
429
+ "params": {
430
+ "sample_frames": int(sample_frames),
431
+ "patch_size": int(patch_size),
432
+ "top_k_per_frame": int(top_k_per_frame),
433
+ "max_pixels": int(max_pixels),
434
+ "start_sec": float(s_sec),
435
+ "end_sec": float(e_sec) if e_sec > 0 else None,
436
+ "saliency_signal": saliency_signal,
437
+ "score_log_scale": bool(score_log_scale),
438
+ "bitcost_pct": float(bitcost_pct),
439
+ "fade_strength": float(fade_strength),
440
  },
441
+ "frame_window": {
442
+ "first_decoded": int(f_start),
443
+ "last_decoded": int(f_end),
444
+ "actual_frame_ids": [int(x) for x in fids],
 
445
  },
446
+ "resized_frame_size": f"{tw}x{th}",
447
+ "patch_grid_per_frame": f"{hb}x{wb} = {hb * wb} patches",
448
+ "selected_per_frame": int(min(top_k_per_frame, hb * wb)),
449
+ "total_selected_patches": int(n_selected),
450
+ "canvas_resolution": f"{canvas.shape[1]}x{canvas.shape[0]}",
451
+ "vis_video_fps": round(vis_fps, 2),
452
+ "viz_mode": mode,
453
+ "heatmap_alpha": float(heatmap_alpha) if mode != "selection" else None,
454
+ "score_normalization": f"shift-min, /p{bitcost_pct:.1f}, clip"
455
+ + (" (log1p applied)" if score_log_scale else ""),
456
+ "elapsed_sec": round(time.time() - t0, 2),
457
  }
458
+ progress(1.0, desc="Done")
459
+ return vis_path, canvas_path, json.dumps(info, indent=2, ensure_ascii=False)
460
+
461
+
462
+ CUSTOM_CSS = """
463
+ :root, .gradio-container, .gradio-container.dark {
464
+ --ovc-grad: linear-gradient(135deg, #4f46e5 0%, #2563eb 50%, #06b6d4 100%);
465
+ }
466
+ .gradio-container { max-width: 1280px !important; margin: 0 auto !important; }
467
+ #ovc-hero {
468
+ text-align: center;
469
+ padding: 28px 16px 8px;
470
+ border-radius: 16px;
471
+ background: linear-gradient(180deg, rgba(79,70,229,0.08), rgba(6,182,212,0.04));
472
+ margin-bottom: 8px;
473
+ }
474
+ #ovc-hero h1 {
475
+ font-size: 2.1rem;
476
+ font-weight: 700;
477
+ background: var(--ovc-grad);
478
+ -webkit-background-clip: text;
479
+ background-clip: text;
480
+ color: transparent;
481
+ margin: 0 0 6px;
482
+ letter-spacing: -0.02em;
483
+ }
484
+ #ovc-hero p.tagline {
485
+ font-size: 1.02rem;
486
+ color: var(--body-text-color-subdued);
487
+ margin: 0 auto 12px;
488
+ max-width: 720px;
489
+ line-height: 1.55;
490
+ }
491
+ #ovc-hero .pills { display:flex; flex-wrap:wrap; gap:6px; justify-content:center; margin-top:6px; }
492
+ #ovc-hero .pill {
493
+ font-size: 0.78rem;
494
+ font-weight: 600;
495
+ padding: 4px 10px;
496
+ border-radius: 999px;
497
+ color: #fff;
498
+ background: var(--ovc-grad);
499
+ opacity: 0.92;
500
+ }
501
+ .ovc-card {
502
+ border-radius: 14px !important;
503
+ padding: 14px 16px !important;
504
+ border: 1px solid var(--border-color-primary) !important;
505
+ background: var(--background-fill-primary) !important;
506
+ box-shadow: 0 1px 2px rgba(0,0,0,0.04);
507
+ }
508
+ .ovc-card h3 {
509
+ font-size: 0.86rem !important;
510
+ font-weight: 700 !important;
511
+ text-transform: uppercase;
512
+ letter-spacing: 0.06em;
513
+ color: var(--body-text-color-subdued) !important;
514
+ margin: 0 0 8px !important;
515
+ }
516
+ #ovc-run button {
517
+ width: 100%;
518
+ height: 48px !important;
519
+ font-size: 1.02rem !important;
520
+ font-weight: 600 !important;
521
+ background: var(--ovc-grad) !important;
522
+ border: none !important;
523
+ color: #fff !important;
524
+ border-radius: 12px !important;
525
+ box-shadow: 0 4px 14px rgba(37, 99, 235, 0.35);
526
+ transition: transform 0.05s ease;
527
+ }
528
+ #ovc-run button:hover { transform: translateY(-1px); }
529
+ #ovc-run button:active { transform: translateY(0); }
530
+ #ovc-footer {
531
+ text-align: center;
532
+ color: var(--body-text-color-subdued);
533
+ font-size: 0.78rem;
534
+ padding: 18px 8px 8px;
535
+ margin-top: 10px;
536
+ }
537
+ """
538
+
539
+ THEME = gr.themes.Soft(
540
+ primary_hue="indigo",
541
+ secondary_hue="blue",
542
+ neutral_hue="slate",
543
+ font=[gr.themes.GoogleFont("Inter"), "system-ui", "sans-serif"],
544
+ ).set(
545
+ body_background_fill="*neutral_50",
546
+ block_radius="14px",
547
+ button_primary_background_fill="*primary_500",
548
+ button_primary_background_fill_hover="*primary_600",
549
+ )
550
+
551
+ HERO_HTML = """
552
+ <div id="ovc-hero">
553
+ <h1>OneVision Encoder Codec View</h1>
554
+ <p class="tagline">
555
+ Visualize which patches a codec-style saliency picks from your video,
556
+ then pack them into the canvas LLaVA-OneVision2 consumes.
557
+ Use it to inspect <i>where</i> the model is actually looking.
558
+ </p>
559
+ <div class="pills">
560
+ <span class="pill">selection · heatmap · sbs</span>
561
+ <span class="pill">gradient + motion signals</span>
562
+ <span class="pill">canvas export</span>
563
+ </div>
564
+ </div>
565
+ """
566
+
567
+ with gr.Blocks(title="OneVision Encoder Codec View", theme=THEME, css=CUSTOM_CSS) as demo:
568
+ gr.HTML(HERO_HTML)
569
+
570
+ with gr.Row(equal_height=False):
571
+ # ─── Controls (narrow column) ────────────────────────────────────
572
+ with gr.Column(scale=4, min_width=320):
573
+ with gr.Group(elem_classes="ovc-card"):
574
+ gr.Markdown("### Input")
575
+ video_in = gr.Video(label="Video", sources=["upload"], height=240)
576
 
577
+ with gr.Group(elem_classes="ovc-card"):
578
+ gr.Markdown("### Pipeline")
579
+ viz_mode = gr.Radio(
580
+ ["selection", "heatmap", "sbs"], value="selection",
581
+ label="Visualization mode",
582
+ )
583
+ sample_frames = gr.Slider(
584
+ 4, 64, value=16, step=1, label="Sampled frames",
585
+ )
586
+ top_k = gr.Slider(
587
+ 4, 1024, value=64, step=4, label="Top-K patches per frame",
588
+ )
589
+ patch_size = gr.Radio(
590
+ PATCH_CHOICES, value=14, label="Patch size (px)",
591
+ )
592
+
593
+ with gr.Accordion("Time window", open=False):
594
+ with gr.Row():
595
+ start_sec = gr.Number(value=0.0, precision=2, label="Start (s)")
596
+ end_sec = gr.Number(value=0.0, precision=2, label="End (s)")
597
+ gr.Markdown(
598
+ "<small>Set both to 0 to use the full video.</small>",
599
+ )
600
+
601
+ with gr.Accordion("Saliency", open=False):
602
+ saliency_signal = gr.Radio(
603
+ ["gradient", "frame_diff", "combined"], value="gradient",
604
+ label="Scoring signal",
605
+ info="gradient = intra-frame Sobel · "
606
+ "frame_diff = inter-frame motion · "
607
+ "combined = 0.5 each.",
608
+ )
609
+ score_log_scale = gr.Checkbox(
610
+ value=False, label="Apply log1p to scores",
611
+ )
612
+ bitcost_pct = gr.Slider(
613
+ 80.0, 99.9, value=99.0, step=0.1,
614
+ label="Heatmap normalization percentile",
615
+ )
616
+
617
+ with gr.Accordion("Visual style", open=False):
618
+ heatmap_alpha = gr.Slider(
619
+ 0.1, 0.9, value=0.55, step=0.05,
620
+ label="Heatmap blend α",
621
+ )
622
+ fade_strength = gr.Slider(
623
+ 0.0, 0.9, value=0.55, step=0.05,
624
+ label="Selection fade strength",
625
+ )
626
+ max_pixels = gr.Slider(
627
+ 40000, 400000, value=150000, step=10000,
628
+ label="Max pixels per frame",
629
+ )
630
+
631
+ with gr.Row(elem_id="ovc-run"):
632
+ run_btn = gr.Button("Run pipeline", variant="primary")
633
+
634
+ # ─── Outputs (wide column) ───────────────────────────────────────
635
+ with gr.Column(scale=6, min_width=420):
636
+ with gr.Group(elem_classes="ovc-card"):
637
+ gr.Markdown("### Patch selection visualization")
638
+ vis_out = gr.Video(
639
+ label="", show_label=False, autoplay=True, height=420,
640
+ )
641
+ with gr.Row():
642
+ with gr.Column(scale=1):
643
+ with gr.Group(elem_classes="ovc-card"):
644
+ gr.Markdown("### Packed canvas")
645
+ canvas_out = gr.Image(
646
+ label="", show_label=False, show_download_button=True,
647
+ height=320,
648
+ )
649
+ with gr.Column(scale=1):
650
+ with gr.Group(elem_classes="ovc-card"):
651
+ gr.Markdown("### Run info")
652
+ info_out = gr.Code(
653
+ label="", language="json", lines=14,
654
+ )
655
+
656
+ gr.HTML(
657
+ '<div id="ovc-footer">'
658
+ 'Approximation of the bitcost-driven patch selection in '
659
+ '<code>codec_tools/</code> · Sobel + frame-diff used as a stand-in '
660
+ 'for the ffmpeg bitcost patch.'
661
+ '</div>'
662
+ )
663
 
664
+ run_btn.click(
665
+ process,
666
+ inputs=[
667
+ video_in, sample_frames, patch_size, top_k, max_pixels,
668
+ viz_mode, heatmap_alpha,
669
+ start_sec, end_sec,
670
+ saliency_signal, score_log_scale, bitcost_pct, fade_strength,
671
+ ],
672
+ outputs=[vis_out, canvas_out, info_out],
673
  )
 
 
 
 
 
 
 
 
674
 
675
 
676
  if __name__ == "__main__":
677
+ demo.launch(server_name="0.0.0.0", server_port=int(os.environ.get("PORT", 7860)))
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ opencv-python-headless>=4.8
2
+ numpy>=1.24
3
+ imageio>=2.34
4
+ imageio-ffmpeg>=0.5
5
+ Pillow>=10.0