THUDM
/

cogvlm2-video-llama3-chat

@@ -13,20 +13,41 @@ tags:
 inference: false
 ---
-# CogVLM2-Video
 [中文版本README](README_zh.md)
-CogVLM2-Video achieves state-of-the-art performance on multiple video question answering tasks. The following diagram
-shows the performance of CogVLM2-Video on
 the [MVBench](https://github.com/OpenGVLab/Ask-Anything), [VideoChatGPT-Bench](https://github.com/mbzuai-oryx/Video-ChatGPT)
 and Zero-shot VideoQA datasets (MSVD-QA, MSRVTT-QA, ActivityNet-QA). Where VCG-* refers to the VideoChatGPTBench, ZS-*
 refers to Zero-Shot VideoQA datasets and MV-* refers to main categories in the MVBench.
 ![Quantitative Evaluation](https://github.com/THUDM/CogVLM2/tree/main/resources/cogvlm2_video_bench.jpeg)
-## Detailed performance
 Performance on VideoChatGPT-Bench and Zero-shot VideoQA dataset:
 | Models                | VCG-AVG  | VCG-CI   | VCG-DO   | VCG-CU   | VCG-TU   | VCG-CO   | ZS-AVG    |
@@ -41,15 +62,15 @@ Performance on VideoChatGPT-Bench and Zero-shot VideoQA dataset:
 Performance on MVBench dataset:
-| Model                 | AVG      | AA       | AC       | AL       | AP       | AS       | CO       | CI       | EN    | ER       | FA       | FP       | MA       | MC       | MD       | OE       | OI       | OS   | ST       | SC   | UA       |
-|-----------------------|----------|----------|----------|----------|----------|----------|----------|----------|-------|----------|----------|----------|----------|----------|----------|----------|----------|------|----------|------|----------|
-| IG-VLM GPT4V          | 43.7     | 72.0     | 39.0     | 40.5     | **63.5** | 55.5     | 52.0     | 11.0     | 31.0  | 59.0     | 46.5     | 47.5     | 22.5     | 12.0     | 12.0     | 18.5     | 59.0     | 29.5 | 83.5     | 45.0 | 73.5     |
-| ST-LLM                | 54.9     | 84.0     | 36.5     | 31.0     | 53.5     | 66.0     | 46.5     | 58.5     | 34.5  | 41.5     | 44.0     | 44.5     | 78.5     | 56.5     | 42.5     | 80.5     | 73.5     | 38.5 | 86.5     | 43.0 | 58.5     |
-| ShareGPT4Video        | 51.2     | 79.5     | 35.5     | 41.5     | 39.5     | 49.5     | 46.5     | 51.5     | 28.5  | 39.0     | 40.0     | 25.5     | 75.0     | 62.5     | 50.5     | 82.5     | 54.5     | 32.5 | 84.5     | 51.0 | 54.5     |
-| VideoGPT+             | 58.7     | 83.0     | 39.5     | 34.0     | 60.0     | **69.0** | 50.0     | 60.0     | 29.5  | 44.0     | 48.5     | 53.0     | 90.5     | 71.0     | 44.0     | **85.5** | 75.5     | 36.0 | 89.5     | 45.0 | 66.5     |
-| VideoChat2_HD_mistral | 62.3     | 79.5     | **60.0** | **87.5** | 50.0     | 68.5     | **93.5** | 71.5     | 36.5  | 45.0     | 49.5     | **87.0** | 40.0     | **76.0** | **92.0** | 53.0     | 62.0     | 45.5 | 36.0     | 44.0 | 69.5     |
-| PLLaVA-34B            | 58.1     | 82.0     | 40.5     | 49.5     | 53.0     | 67.5     | 66.5     | 59.0     | l39.5 | **63.5** | 47.0     | 50.0     | 70.0     | 43.0     | 37.5     | 68.5     | 67.5     | 36.5 | **91.0** | 51.5 | **79.0** |
-| CogVLM2-Video         | **62.3** | **85.5** | 41.5     | 31.5     | 65.5     | 79.5     | 58.5     | **77.0** | 28.5  | 42.5     | **54.0** | 57.0     | **91.5** | 73.0     | 48.0     | **91.0** | **78.0** | 36.0 | **91.5** | 47.0 | 68.5     |
 ## Evaluation details
@@ -77,10 +98,11 @@ our [github](https://github.com/THUDM/CogVLM2/tree/main/video_demo).
 ## License
-This model is released under the CogVLM2 [LICENSE](LICENSE). For models built with Meta Llama 3, please also adhere to
-the [LLAMA3_LICENSE](LLAMA3_LICENSE).
 ## Training details
 Pleaser refer to our technical report for training formula and hyperparameters.

 inference: false
 ---
+# CogVLM2-Video-Llama3-Chat
 [中文版本README](README_zh.md)
+## Introduction
+CogVLM2-Video achieves state-of-the-art performance on multiple video question answering tasks. It can achieve video
+understanding within one minute. We provide two example videos to demonstrate CogVLM2-Video's video understanding and
+video temporal grounding capabilities.
+<table>
+  <tr>
+    <td>
+      <video width="100%" controls>
+        <source src="https://github.com/THUDM/CogVLM2/raw/main/resources/videos/lion.mp4" type="video/mp4">
+      </video>
+    </td>
+    <td>
+      <video width="100%" controls>
+        <source src="https://github.com/THUDM/CogVLM2/raw/main/resources/videos/basketball.mp4" type="video/mp4">
+      </video>
+    </td>
+  </tr>
+</table>
+## BenchMark
+The following diagram shows the performance of CogVLM2-Video on
 the [MVBench](https://github.com/OpenGVLab/Ask-Anything), [VideoChatGPT-Bench](https://github.com/mbzuai-oryx/Video-ChatGPT)
 and Zero-shot VideoQA datasets (MSVD-QA, MSRVTT-QA, ActivityNet-QA). Where VCG-* refers to the VideoChatGPTBench, ZS-*
 refers to Zero-Shot VideoQA datasets and MV-* refers to main categories in the MVBench.
 ![Quantitative Evaluation](https://github.com/THUDM/CogVLM2/tree/main/resources/cogvlm2_video_bench.jpeg)
 Performance on VideoChatGPT-Bench and Zero-shot VideoQA dataset:
 | Models                | VCG-AVG  | VCG-CI   | VCG-DO   | VCG-CU   | VCG-TU   | VCG-CO   | ZS-AVG    |
 Performance on MVBench dataset:
+| Models                | AVG      | AA       | AC       | AL       | AP       | AS       | CO       | CI       | EN       | ER       | FA       | FP       | MA       | MC       | MD       | OE       | OI       | OS       | ST       | SC       | UA       |
+|-----------------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
+| IG-VLM GPT4V          | 43.7     | 72.0     | 39.0     | 40.5     | 63.5     | 55.5     | 52.0     | 11.0     | 31.0     | 59.0     | 46.5     | 47.5     | 22.5     | 12.0     | 12.0     | 18.5     | 59.0     | 29.5     | 83.5     | 45.0     | 73.5     |
+| ST-LLM                | 54.9     | 84.0     | 36.5     | 31.0     | 53.5     | 66.0     | 46.5     | 58.5     | 34.5     | 41.5     | 44.0     | 44.5     | 78.5     | 56.5     | 42.5     | 80.5     | 73.5     | 38.5     | 86.5     | 43.0     | 58.5     |
+| ShareGPT4Video        | 51.2     | 79.5     | 35.5     | 41.5     | 39.5     | 49.5     | 46.5     | 51.5     | 28.5     | 39.0     | 40.0     | 25.5     | 75.0     | 62.5     | 50.5     | 82.5     | 54.5     | 32.5     | 84.5     | 51.0     | 54.5     |
+| VideoGPT+             | 58.7     | 83.0     | 39.5     | 34.0     | 60.0     | 69.0     | 50.0     | 60.0     | 29.5     | 44.0     | 48.5     | 53.0     | 90.5     | 71.0     | 44.0     | 85.5     | 75.5     | 36.0     | 89.5     | 45.0     | 66.5     |
+| VideoChat2_HD_mistral | **62.3** | 79.5     | **60.0** | **87.5** | 50.0     | 68.5     | **93.5** | 71.5     | 36.5     | 45.0     | 49.5     | **87.0** | 40.0     | **76.0** | **92.0** | 53.0     | 62.0     | **45.5** | 36.0     | 44.0     | 69.5     |
+| PLLaVA-34B            | 58.1     | 82.0     | 40.5     | 49.5     | 53.0     | 67.5     | 66.5     | 59.0     | **39.5** | **63.5** | 47.0     | 50.0     | 70.0     | 43.0     | 37.5     | 68.5     | 67.5     | 36.5     | 91.0     | 51.5     | **79.0** |
+| CogVLM2-Video         | **62.3** | **85.5** | 41.5     | 31.5     | **65.5** | **79.5** | 58.5     | **77.0** | 28.5     | 42.5     | **54.0** | 57.0     | **91.5** | 73.0     | 48.0     | **91.0** | **78.0** | 36.0     | **91.5** | **47.0** | 68.5     |
 ## Evaluation details
 ## License
+This model is released under the
+CogVLM2  [LICENSE](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0).
+For models built with Meta Llama 3, please also adhere to
+the [LLAMA3_LICENSE](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0).
 ## Training details
 Pleaser refer to our technical report for training formula and hyperparameters.

README_zh.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# CogVLM2-Video
 CogVLM2-Video 在多个视频问答任务上实现了最先进的性能。下图显示了 CogVLM2-Video
 在 [MVBench](https://github.com/OpenGVLab/Ask-Anything)、[VideoChatGPT-Bench](https://github.com/mbzuai-oryx/Video-ChatGPT)
@@ -57,8 +57,8 @@ prompt = f"The input consists of a sequence of key frames from a video. Answer t
 ## 模型协议
-此模型根据 CogVLM2 [LICENSE](LICENSE) 发布。对于使用 Meta Llama 3 构建的模型，还请遵守
-[LLAMA3_LICENSE](LLAMA3_LICENSE)。
 ## 引用

+# CogVLM2-Video-Llama3-Chat
 CogVLM2-Video 在多个视频问答任务上实现了最先进的性能。下图显示了 CogVLM2-Video
 在 [MVBench](https://github.com/OpenGVLab/Ask-Anything)、[VideoChatGPT-Bench](https://github.com/mbzuai-oryx/Video-ChatGPT)
 ## 模型协议
+此模型根据 CogVLM2 [LICENSE](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0) 发布。对于使用 Meta Llama 3 构建的模型，还请遵守
+[LLAMA3_LICENSE](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0)。
 ## 引用