LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Abstract
Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.
Community
All open-sourced!
Project Page: https://showlab.github.io/livecc
Gradio Demo: https://huggingface.co/spaces/chenjoya/LiveCC
Training Code: https://github.com/showlab/livecc
SFT Model: https://huggingface.co/chenjoya/LiveCC-7B-Instruct
SFT Dataset: https://huggingface.co/datasets/chenjoya/Live-WhisperX-526K
Pretrain Model: https://huggingface.co/chenjoya/LiveCC-7B-Base
Pretrain Dataset: https://huggingface.co/datasets/chenjoya/Live-CC-5M
Benchmark: https://huggingface.co/datasets/stdKonjac/LiveSports-3K
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models (2025)
- Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models (2025)
- Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs (2025)
- Generative Frame Sampler for Long Video Understanding (2025)
- VideoA11y: Method and Dataset for Accessible Video Description (2025)
- LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding (2025)
- TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend