--- license: apache-2.0 pipeline_tag: image-text-to-text --- **
TinyLLaVA-Video
** [![arXiv](https://img.shields.io/badge/Arxiv-2402.14289-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.15513)[![Github](https://img.shields.io/badge/Github-Github-blue.svg)](https://github.com/ZhangXJ199/TinyLLaVA-Video) Here, we introduce TinyLLaVA-Video-Phi2-16-512. For LLM and vision tower, we choose [Phi-2](https://huggingface.co/microsoft/phi-2) and [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384), respectively. The model samples 16 frames from each video and represents the video sequence using 512 tokens. ### Result | VT (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU | | :----------------------------------------: | ------------ | ------------- | ------- | -------------- | ---------- | | [Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512](https://huggingface.co/Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512) | 16/512 | 44.7 | 42.5 | 37.6 | 48.1 | | [Zhang199/TinyLLaVA-Video-Phi2-16-512](https://huggingface.co/Zhang199/TinyLLaVA-Video-Phi2-16-512) | 16/512 | 42.7 | 42.0 | 42.2 | 46.5 |