Current observations
- L40S 48GB → faster than realtime. Use pace:"realtime"to avoid client over‑buffering.
- L4 24GB → slightly below realtime even with pre‑roll buffering, TF32/Autotune, smaller chunks (max_decode_frames), and the base checkpoint.
Practical guidance
- For consistent realtime, target ~40GB VRAM per active stream (e.g., A100 40GB, or MIG slices ≈ 35–40GB on newer GPUs).
- Keep client‑side overlap‑add (25–40 ms) for seamless chunk joins.
- Prefer pace:"realtime"once playback begins; use ASAP only to build a short pre‑roll if needed.
- Optional knob: max_decode_frames(default 50 ≈ 2.0 s). Reducing to 36–45 can lower per‑chunk latency/VRAM, but doesn't increase frames/sec throughput.
Concurrency
This research build is designed for one active jam per GPU. Concurrency would require GPU partitioning (MIG) or horizontal scaling with a session scheduler.
