You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for Model ID

https://showlab.github.io/videollm-online/

Model Details

  • LLM: meta-llama/Meta-Llama-3-8B-Instruct
  • Vision Strategy:
    • Frame Encoder: google/siglip-large-patch16-384
    • Frame Tokens: CLS Token + Avg Pooled 3x3 Tokens
    • Frame FPS: 2 for training, 2~10 for inference
    • Frame Resolution: max resolution 384, with zero-padding to keep aspect ratio
    • Video Length: 10 minutes
  • Training Data: Ego4D Narration Stream 113K + Ego4D GoalStep Stream 21K

Model Sources

Uses

  • First, clone the github repository and follow the installation instruction:
git clone https://github.com/showlab/videollm-online

Ensure you have Miniconda and Python version >= 3.10 installed, then run:

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

PyTorch source will make ffmpeg installed, but it is an old version and usually make very low quality preprocessing. Please install newest ffmpeg following:

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

If you want to try our model with the audio in real-time streaming, please also clone ChatTTS.

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/
  • Launch the gradio demo locally with:
python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus
  • Or launch the CLI locally with:
python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

Citation

@inproceedings{videollm-online,
  author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
  title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
  booktitle    = {CVPR},
  year         = {2024},
}
Downloads last month
1,955
Inference API
Inference API (serverless) does not yet support peft models for this pipeline type.

Model tree for chenjoya/videollm-online-8b-v1plus

Adapter
(652)
this model

Dataset used to train chenjoya/videollm-online-8b-v1plus