Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Kangaroo has been released. Please check out our paper, blog and github for details.

Abstract

We introduce Kangaroo, a powerful Multimodal Large Language Model designed for long-context video understanding. Our presented Kangaroo model shows remarkable performance across diverse video understanding tasks including video caption, QA and conversation. Generally, our key contributions in this work can be summarized as follows:

Long-context Video Input. To enhance the model's capability to comprehend longer videos, we extend the maximum frames of input videos to 160. To this end, we aggregate multiple videos with variable frame counts and aspect ratios into one sample. We further design a spatial-temporal pathify module to improve training efficiency.
Strong Performance. We evaluate our model across various video understanding benchmarks. The results indicate that our model achieves state-of-the-art performance on the majority of comprehensive benchmarks and maintain a competitive level in others. Notably, our model outperforms most larger open-source models with over 30B parameters and some proprietary models on certain benchmarks.
Video Annotation System. We develop a data curation and automatic annotation system to generate captions for open-source and internal videos. The generated large-scale dataset are utilized for video-text pre-training. For video instruction tuning stage, we construct a video instruciton tuning dataset based on public and internal datasets covering a variety of tasks.
Billingual Conversation. Our proposed model is equipped with the capability of Chinese, English and billingual conversations, and support single/multi-round conversation paradigms.

Quick Start

Installation

See our github page

Multi-round Chat with 🤗 Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("KangarooGroup/kangaroo")
model = AutoModelForCausalLM.from_pretrained(
    "KangarooGroup/kangaroo",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = model.to("cuda")
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]

video_path = "/path/to/video"

# Round 1
query = "Give a brief description of the video."
out, history = model.chat(video_path=video_path,
                          query=query,
                          tokenizer=tokenizer,
                          max_new_tokens=512,
                          eos_token_id=terminators,
                          do_sample=True,
                          temperature=0.6,
                          top_p=0.9,)
print('Assitant: \n', out)

# Round 2
query = "What happend at the end of the video?"
out, history = model.chat(video_path=video_path,
                          query=query,
                          history=history,
                          tokenizer=tokenizer,
                          max_new_tokens=512,
                          eos_token_id=terminators,
                          do_sample=True,
                          temperature=0.6,
                          top_p=0.9,)
print('Assitant: \n', out)

Citation

If you find it useful for your research , please cite related papers/blogs using this BibTeX:


@misc{kangaroogroup,
    title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
    url={https://kangaroogroup.github.io/Kangaroo.github.io/},
    author={Jiajun Liu and Yibing Wang and Hanghang Ma and Xiaoping Wu and Xiaoqi Ma and Jie Hu},
    month={July},
    year={2024}
}