Model Card

Paper | Model

Video-XL-Pro 3B is a powerful multimodal large model designed for extremely long video understanding, supporting up to 10,000-frame input. Leveraging a novel Reconstructive Token Compression mechanism, it enables efficient and effective long-range temporal reasoning.

✨ Highlights

🚀 SOTA Performance among 3B-scale models on:
- MLVU
- VideoMME
- VNBench
- LongVideoBench
🧠 Efficient Long Video Processing:
- Handles up to 10,000 frames on a single 80G A100 GPU
- Achieves ~98% accuracy on Needle-in-a-Haystack benchmark

Quickstart

Before running the following code snippet, ensure you have installed the necessary dependencies by following the installation guide at our official github repo. The installation process includes setting up the conda environment, installing PyTorch, and other required packages.

Runing Scripts

import torch
import transformers
import gc
from videoxlpro.videoxlpro.demo_utils import process_video, load_image_processor, generate_response
from transformers import AutoTokenizer, AutoModelForCausalLM
import warnings

# 禁用一些警告
transformers.logging.set_verbosity_error()
warnings.filterwarnings('ignore')

# 设置设备
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 模型路径
model_path = "lxr2003/Video-XL-Pro-3B"
video_path = "/path/to/your/example_video.mp4"

# 使用 Auto 类加载模型
# 使用 Auto 类加载模型
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    low_cpu_mem_usage=True, 
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",
    device_map=device,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True
)

image_processor = load_image_processor(model, tokenizer)

max_frames_num = 128

# 处理视频
video_tensor,time_embed = process_video(video_path,tokenizer, image_processor, model.device, max_frames_num)

# 生成参数
gen_kwargs = {
    "do_sample": True,
    "temperature": 0.01,
    "top_p": 0.001,
    "num_beams": 1,
    "use_cache": True,
    "max_new_tokens": 256
}

# 文本提示
prompt = "Describe this video."

text = f"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

response = generate_response(model, tokenizer, text, video_tensor,time_embed, gen_kwargs)

# 4. 输出结果
print("\n===== 生成的回答 =====")
print(response)

Note: Replace 'example_video.mp4' with your actual video path.

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.