zR commited on
Commit
8f92f5a
·
1 Parent(s): a3d3fa5

Update non-LFS files

Browse files
Files changed (7) hide show
  1. .mdl +0 -0
  2. .msc +0 -0
  3. .mv +0 -1
  4. LICENSE +75 -0
  5. LLAMA3_LICENSE +117 -0
  6. README.md +86 -0
  7. README_zh.md +65 -0
.mdl DELETED
Binary file (59 Bytes)
 
.msc DELETED
Binary file (1.26 kB)
 
.mv DELETED
@@ -1 +0,0 @@
1
- Revision:master,CreatedAt:1719926614
 
 
LICENSE ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The CogVLM License
2
+
3
+ 1. Definitions
4
+
5
+ “Licensor” means the CogVLM Model Team that distributes its Software.
6
+
7
+ “Software” means the CogVLM model parameters made available under this license.
8
+
9
+ 2. License Grant
10
+
11
+ Under the terms and conditions of this license, the Licensor hereby grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license.
12
+ This license permits you to use all open-source models in this repository for academic research free. Users who wish to use the models for commercial purposes must register [here](https://open.bigmodel.cn/mla/form).
13
+ Registered users may use the models for commercial activities free of charge, but must comply with all terms and conditions of this license.
14
+ The license notice shall be included in all copies or substantial portions of the Software.
15
+
16
+ 3. Restriction
17
+
18
+ You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any military, or illegal purposes.
19
+
20
+ You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
21
+
22
+ 4. Disclaimer
23
+
24
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
25
+
26
+ 5. Limitation of Liability
27
+
28
+ EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
29
+
30
+ 6. Dispute Resolution
31
+
32
+ This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
33
+
34
+ Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at license@zhipuai.cn.
35
+
36
+ 7. Llama3 and EVA-CLIP2 License
37
+
38
+ For the CogVLM2 open source model based on the LLama3 series model as the base model, the Llama3 license conditions (https://llama.meta.com/llama3/license/, a copy of this repository license conditions) and the EVA-CLIP2 license conditions (MIT , https://github.com/baaivision/EVA/blob/master/LICENSE) for model weights.
39
+
40
+ 1. 定义
41
+
42
+ “许可方”是指分发其软件的 CogVLM 模型团队。
43
+
44
+ “软件”是指根据本许可提供的 CogVLM 模型参数。
45
+
46
+ 2. 许可授予
47
+
48
+ 根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。
49
+ 本许可允许您免费使用本仓库中的所有开源模型进行学术研究,对于希望将模型用于商业目的的用户,需在[这里](https://open.bigmodel.cn/mla/form)完成登记。
50
+ 经过登记的用户可以免费使用本模型进行商业活动,但必须遵守本许可的所有条款和条件。
51
+ 上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
52
+
53
+ 3.限制
54
+
55
+ 您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
56
+
57
+ 您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
58
+
59
+ 4.免责声明
60
+
61
+ 本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
62
+
63
+ 5. 责任限制
64
+
65
+ 除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。
66
+
67
+ 6.争议解决
68
+
69
+ 本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
70
+
71
+ 请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请���过 license@zhipuai.cn 与我们联系。
72
+
73
+ 7. Llama3 和 EVA-CLIP2 许可
74
+
75
+ 针对基于以 LLama3 系列模型作为基座模型的 CogVLM2 开源模型, Llama3 许可条件 (https://llama.meta.com/llama3/license/ ,本仓库副本一份许可条件) 和 EVA-CLIP2 许可条件 (MIT, https://github.com/baaivision/EVA/blob/master/LICENSE) 适用于模型权重。
LLAMA3_LICENSE ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ META LLAMA 3 COMMUNITY LICENSE AGREEMENT
2
+ Meta Llama 3 Version Release Date: April 18, 2024
3
+
4
+ “Agreement” means the terms and conditions for use, reproduction, distribution and modification of the
5
+ Llama Materials set forth herein.
6
+
7
+ “Documentation” means the specifications, manuals and documentation accompanying Meta Llama 3
8
+ distributed by Meta at https://llama.meta.com/get-started/.
9
+
10
+ “Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into
11
+ this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or
12
+ regulations to provide legal consent and that has legal authority to bind your employer or such other
13
+ person or entity if you are entering in this Agreement on their behalf.
14
+
15
+ “Meta Llama 3” means the foundational large language models and software and algorithms, including
16
+ machine-learning model code, trained model weights, inference-enabling code, training-enabling code,
17
+ fine-tuning enabling code and other elements of the foregoing distributed by Meta at
18
+ https://llama.meta.com/llama-downloads.
19
+
20
+ “Llama Materials” means, collectively, Meta’s proprietary Meta Llama 3 and Documentation (and any
21
+ portion thereof) made available under this Agreement.
22
+
23
+ “Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your
24
+ principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located
25
+ outside of the EEA or Switzerland).
26
+
27
+ By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials,
28
+ you agree to be bound by this Agreement.
29
+
30
+ 1. License Rights and Redistribution.
31
+
32
+ a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free
33
+ limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama
34
+ Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the
35
+ Llama Materials.
36
+
37
+ b. Redistribution and Use.
38
+
39
+ i. If you distribute or make available the Llama Materials (or any derivative works
40
+ thereof), or a product or service that uses any of them, including another AI model, you shall (A) provide
41
+ a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Meta
42
+ Llama 3” on a related website, user interface, blogpost, about page, or product documentation. If you
43
+ use the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is
44
+ distributed or made available, you shall also include “Llama 3” at the beginning of any such AI model
45
+ name.
46
+
47
+ ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part
48
+ of an integrated end user product, then Section 2 of this Agreement will not apply to you.
49
+
50
+ iii. You must retain in all copies of the Llama Materials that you distribute the following
51
+ attribution notice within a “Notice” text file distributed as a part of such copies: “Meta Llama 3 is
52
+ licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights
53
+ Reserved.”
54
+
55
+ iv. Your use of the Llama Materials must comply with applicable laws and regulations
56
+ (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama
57
+ Materials (available at https://llama.meta.com/llama3/use-policy), which is hereby incorporated by
58
+ reference into this Agreement.
59
+
60
+ v. You will not use the Llama Materials or any output or results of the Llama Materials to
61
+ improve any other large language model (excluding Meta Llama 3 or derivative works thereof).
62
+
63
+ 2. Additional Commercial Terms. If, on the Meta Llama 3 version release date, the monthly active users
64
+ of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700
65
+ million monthly active users in the preceding calendar month, you must request a license from Meta,
66
+ which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the
67
+ rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
68
+
69
+ 3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY
70
+ OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF
71
+ ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED,
72
+ INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT,
73
+ MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR
74
+ DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND
75
+ ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND
76
+ RESULTS.
77
+
78
+ 4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF
79
+ LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING
80
+ OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL,
81
+ INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED
82
+ OF THE POSSIBILITY OF ANY OF THE FOREGOING.
83
+
84
+ 5. Intellectual Property.
85
+
86
+ a. No trademark licenses are granted under this Agreement, and in connection with the Llama
87
+ Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other
88
+ or any of its affiliates, except as required for reasonable and customary use in describing and
89
+ redistributing the Llama Materials or as set forth in this Section 5(a). Meta hereby grants you a license to
90
+ use “Llama 3” (the “Mark”) solely as required to comply with the last sentence of Section 1.b.i. You will
91
+ comply with Meta’s brand guidelines (currently accessible at
92
+ https://about.meta.com/brand/resources/meta/company-brand/ ). All goodwill arising out of your use
93
+ of the Mark will inure to the benefit of Meta.
94
+
95
+ b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with
96
+ respect to any derivative works and modifications of the Llama Materials that are made by you, as
97
+ between you and Meta, you are and will be the owner of such derivative works and modifications.
98
+
99
+ c. If you institute litigation or other proceedings against Meta or any entity (including a
100
+ cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Meta Llama 3 outputs or
101
+ results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other
102
+ rights owned or licensable by you, then any licenses granted to you under this Agreement shall
103
+ terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold
104
+ harmless Meta from and against any claim by any third party arising out of or related to your use or
105
+ distribution of the Llama Materials.
106
+
107
+ 6. Term and Termination. The term of this Agreement will commence upon your acceptance of this
108
+ Agreement or access to the Llama Materials and will continue in full force and effect until terminated in
109
+ accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in
110
+ breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete
111
+ and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this
112
+ Agreement.
113
+
114
+ 7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of
115
+ the State of California without regard to choice of law principles, and the UN Convention on Contracts
116
+ for the International Sale of Goods does not apply to this Agreement. The courts of California shall have
117
+ exclusive jurisdiction of any dispute arising out of this Agreement.
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: cogvlm2
4
+ license_link: https://huggingface.co/THUDM/cogvlm2-llama3-video-19B/blob/main/LICENSE
5
+
6
+ language:
7
+ - en
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - chat
11
+ - cogvlm2
12
+ - cogvlm--video
13
+
14
+ inference: false
15
+ ---
16
+ # CogVLM2-Video
17
+
18
+ [中文版本README](README_zh.md)
19
+
20
+ CogVLM2-Video achieves state-of-the-art performance on multiple video question answering tasks. The following diagram
21
+ shows the performance of CogVLM2-Video on
22
+ the [MVBench](https://github.com/OpenGVLab/Ask-Anything), [VideoChatGPT-Bench](https://github.com/mbzuai-oryx/Video-ChatGPT)
23
+ and Zero-shot VideoQA datasets (MSVD-QA, MSRVTT-QA, ActivityNet-QA). Where VCG-* refers to the VideoChatGPTBench, ZS-*
24
+ refers to Zero-Shot VideoQA datasets and MV-* refers to main categories in the MVBench.
25
+
26
+ ![Quantitative Evaluation](https://github.com/THUDM/CogVLM2/tree/main/resources/cogvlm2_video_bench.jpeg)
27
+
28
+ ## Detailed performance
29
+
30
+ Performance on VideoChatGPT-Bench and Zero-shot VideoQA dataset:
31
+
32
+ | Models | VCG-AVG | VCG-CI | VCG-DO | VCG-CU | VCG-TU | VCG-CO | ZS-AVG |
33
+ |-----------------------|----------|----------|----------|----------|----------|----------|-----------|
34
+ | IG-VLM GPT4V | 3.17 | 3.40 | 2.80 | 3.61 | 2.89 | 3.13 | 65.70 |
35
+ | ST-LLM | 3.15 | 3.23 | 3.05 | 3.74 | 2.93 | 2.81 | 62.90 |
36
+ | ShareGPT4Video | N/A | N/A | N/A | N/A | N/A | N/A | 46.50 |
37
+ | VideoGPT+ | 3.28 | 3.27 | 3.18 | 3.74 | 2.83 | **3.39** | 61.20 |
38
+ | VideoChat2_HD_mistral | 3.10 | 3.40 | 2.91 | 3.72 | 2.65 | 2.84 | 57.70 |
39
+ | PLLaVA-34B | 3.32 | **3.60** | 3.20 | **3.90** | 2.67 | 3.25 | **68.10** |
40
+ | CogVLM2-Video | **3.41** | 3.49 | **3.46** | 3.87 | **2.98** | 3.23 | 66.60 |
41
+
42
+ Performance on MVBench dataset:
43
+
44
+ | Model | AVG | AA | AC | AL | AP | AS | CO | CI | EN | ER | FA | FP | MA | MC | MD | OE | OI | OS | ST | SC | UA |
45
+ |-----------------------|----------|----------|----------|----------|----------|----------|----------|----------|-------|----------|----------|----------|----------|----------|----------|----------|----------|------|----------|------|----------|
46
+ | IG-VLM GPT4V | 43.7 | 72.0 | 39.0 | 40.5 | **63.5** | 55.5 | 52.0 | 11.0 | 31.0 | 59.0 | 46.5 | 47.5 | 22.5 | 12.0 | 12.0 | 18.5 | 59.0 | 29.5 | 83.5 | 45.0 | 73.5 |
47
+ | ST-LLM | 54.9 | 84.0 | 36.5 | 31.0 | 53.5 | 66.0 | 46.5 | 58.5 | 34.5 | 41.5 | 44.0 | 44.5 | 78.5 | 56.5 | 42.5 | 80.5 | 73.5 | 38.5 | 86.5 | 43.0 | 58.5 |
48
+ | ShareGPT4Video | 51.2 | 79.5 | 35.5 | 41.5 | 39.5 | 49.5 | 46.5 | 51.5 | 28.5 | 39.0 | 40.0 | 25.5 | 75.0 | 62.5 | 50.5 | 82.5 | 54.5 | 32.5 | 84.5 | 51.0 | 54.5 |
49
+ | VideoGPT+ | 58.7 | 83.0 | 39.5 | 34.0 | 60.0 | **69.0** | 50.0 | 60.0 | 29.5 | 44.0 | 48.5 | 53.0 | 90.5 | 71.0 | 44.0 | **85.5** | 75.5 | 36.0 | 89.5 | 45.0 | 66.5 |
50
+ | VideoChat2_HD_mistral | 62.3 | 79.5 | **60.0** | **87.5** | 50.0 | 68.5 | **93.5** | 71.5 | 36.5 | 45.0 | 49.5 | **87.0** | 40.0 | **76.0** | **92.0** | 53.0 | 62.0 | 45.5 | 36.0 | 44.0 | 69.5 |
51
+ | PLLaVA-34B | 58.1 | 82.0 | 40.5 | 49.5 | 53.0 | 67.5 | 66.5 | 59.0 | l39.5 | **63.5** | 47.0 | 50.0 | 70.0 | 43.0 | 37.5 | 68.5 | 67.5 | 36.5 | **91.0** | 51.5 | **79.0** |
52
+ | CogVLM2-Video | **62.3** | **85.5** | 41.5 | 31.5 | 65.5 | 79.5 | 58.5 | **77.0** | 28.5 | 42.5 | **54.0** | 57.0 | **91.5** | 73.0 | 48.0 | **91.0** | **78.0** | 36.0 | **91.5** | 47.0 | 68.5 |
53
+
54
+ ## Evaluation details
55
+
56
+ We follow the previous works to evaluate the performance of our model. In different benchmarks, we craft task-specific
57
+ prompts for each benchmark:
58
+
59
+ ``` python
60
+ # For MVBench
61
+ prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Short Answer:"
62
+ # For VideoChatGPT-Bench
63
+ prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, comprehensively answer the following question. Your answer should be long and cover all the related aspects\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
64
+ # For Zero-shot VideoQA
65
+ prompt = f"The input consists of a sequence of key frames from a video. Answer the question comprehensively including all the possible verbs and nouns that can discribe the events, followed by significant events, characters, or objects that appear throughout the frames.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
66
+ ```
67
+
68
+ For evaluation codes, please refer to
69
+ the [evaluation script](https://github.com/magic-research/PLLaVA/blob/main/README.md) in PLLaVA.
70
+
71
+ ## Using This Model
72
+
73
+ This repository is a `base` version model and does not support chat.
74
+
75
+ You can quickly install the Python package dependencies and run model inference in
76
+ our [github](https://github.com/THUDM/CogVLM2/tree/main/video_demo).
77
+
78
+ ## License
79
+
80
+ This model is released under the CogVLM2 [LICENSE](LICENSE). For models built with Meta Llama 3, please also adhere to
81
+ the [LLAMA3_LICENSE](LLAMA3_LICENSE).
82
+
83
+ ## Training details
84
+
85
+ Pleaser refer to our technical report for training formula and hyperparameters.
86
+
README_zh.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CogVLM2-Video
2
+
3
+ CogVLM2-Video 在多个视频问答任务上实现了最先进的性能。下图显示了 CogVLM2-Video
4
+ 在 [MVBench](https://github.com/OpenGVLab/Ask-Anything)、[VideoChatGPT-Bench](https://github.com/mbzuai-oryx/Video-ChatGPT)
5
+ 和 Zero-shot VideoQA 数据集 (MSVD-QA、MSRVTT-QA、ActivityNet-QA) 上的性能。
6
+
7
+ ![Quantitative Evaluation](https://github.com/THUDM/CogVLM2/tree/main/resources/cogvlm2_video_bench.jpeg)
8
+
9
+ 其中 VCG 指的是 VideoChatGPTBench,ZS 指的是零样本 VideoQA 数据集,MV-* 指的是 MVBench 中的主要类别。
10
+
11
+ ## 评估结论
12
+
13
+ 具体榜单测试数据如下:
14
+
15
+ | Models | VCG-AVG | VCG-CI | VCG-DO | VCG-CU | VCG-TU | VCG-CO | ZS-AVG |
16
+ |-----------------------|----------|----------|----------|----------|----------|----------|-----------|
17
+ | IG-VLM GPT4V | 3.17 | 3.40 | 2.80 | 3.61 | 2.89 | 3.13 | 65.70 |
18
+ | ST-LLM | 3.15 | 3.23 | 3.05 | 3.74 | 2.93 | 2.81 | 62.90 |
19
+ | ShareGPT4Video | N/A | N/A | N/A | N/A | N/A | N/A | 46.50 |
20
+ | VideoGPT+ | 3.28 | 3.27 | 3.18 | 3.74 | 2.83 | **3.39** | 61.20 |
21
+ | VideoChat2_HD_mistral | 3.10 | 3.40 | 2.91 | 3.72 | 2.65 | 2.84 | 57.70 |
22
+ | PLLaVA-34B | 3.32 | **3.60** | 3.20 | **3.90** | 2.67 | 3.25 | **68.10** |
23
+ | CogVLM2-Video | **3.41** | 3.49 | **3.46** | 3.87 | **2.98** | 3.23 | 66.60 |
24
+
25
+ CogVLM2-Video 在 MVBench 数据集上的表现
26
+
27
+ | Model | AVG | AA | AC | AL | AP | AS | CO | CI | EN | ER | FA | FP | MA | MC | MD | OE | OI | OS | ST | SC | UA |
28
+ |-----------------------|----------|----------|----------|----------|----------|----------|----------|----------|-------|----------|----------|----------|----------|----------|----------|----------|----------|------|----------|------|----------|
29
+ | IG-VLM GPT4V | 43.7 | 72.0 | 39.0 | 40.5 | **63.5** | 55.5 | 52.0 | 11.0 | 31.0 | 59.0 | 46.5 | 47.5 | 22.5 | 12.0 | 12.0 | 18.5 | 59.0 | 29.5 | 83.5 | 45.0 | 73.5 |
30
+ | ST-LLM | 54.9 | 84.0 | 36.5 | 31.0 | 53.5 | 66.0 | 46.5 | 58.5 | 34.5 | 41.5 | 44.0 | 44.5 | 78.5 | 56.5 | 42.5 | 80.5 | 73.5 | 38.5 | 86.5 | 43.0 | 58.5 |
31
+ | ShareGPT4Video | 51.2 | 79.5 | 35.5 | 41.5 | 39.5 | 49.5 | 46.5 | 51.5 | 28.5 | 39.0 | 40.0 | 25.5 | 75.0 | 62.5 | 50.5 | 82.5 | 54.5 | 32.5 | 84.5 | 51.0 | 54.5 |
32
+ | VideoGPT+ | 58.7 | 83.0 | 39.5 | 34.0 | 60.0 | **69.0** | 50.0 | 60.0 | 29.5 | 44.0 | 48.5 | 53.0 | 90.5 | 71.0 | 44.0 | **85.5** | 75.5 | 36.0 | 89.5 | 45.0 | 66.5 |
33
+ | VideoChat2_HD_mistral | 62.3 | 79.5 | **60.0** | **87.5** | 50.0 | 68.5 | **93.5** | 71.5 | 36.5 | 45.0 | 49.5 | **87.0** | 40.0 | **76.0** | **92.0** | 53.0 | 62.0 | 45.5 | 36.0 | 44.0 | 69.5 |
34
+ | PLLaVA-34B | 58.1 | 82.0 | 40.5 | 49.5 | 53.0 | 67.5 | 66.5 | 59.0 | l39.5 | **63.5** | 47.0 | 50.0 | 70.0 | 43.0 | 37.5 | 68.5 | 67.5 | 36.5 | **91.0** | 51.5 | **79.0** |
35
+ | CogVLM2-Video | **62.3** | **85.5** | 41.5 | 31.5 | 65.5 | 79.5 | 58.5 | **77.0** | 28.5 | 42.5 | **54.0** | 57.0 | **91.5** | 73.0 | 48.0 | **91.0** | **78.0** | 36.0 | **91.5** | 47.0 | 68.5 |
36
+
37
+ ## 评估和复现
38
+
39
+ 我们遵循以前的研究来评估我们模型的性能。在不同的基准测试中,我们为每个基准测试制作特定于任务的提示:
40
+
41
+ ``` python
42
+ # For MVBench
43
+ prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Short Answer:"
44
+ # For VideoChatGPT-Bench
45
+ prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, comprehensively answer the following question. Your answer should be long and cover all the related aspects\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
46
+ # For Zero-shot VideoQA
47
+ prompt = f"The input consists of a sequence of key frames from a video. Answer the question comprehensively including all the possible verbs and nouns that can discribe the events, followed by significant events, characters, or objects that appear throughout the frames.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
48
+ ```
49
+
50
+ 有关评估代码,请参阅 PLLaVA 中的 [评估脚本](https://github.com/magic-research/PLLaVA/blob/main/README.md)。
51
+
52
+ ## 快速调用
53
+
54
+ 本仓库为 `base` 版本模型,不支持对话。
55
+
56
+ 您可以在我们的 [github](https://github.com/THUDM/CogVLM2/tree/main/video_demo) 中快速安装对应的 Python包 依赖和运行模型推理。
57
+
58
+ ## 模型协议
59
+
60
+ 此模型根据 CogVLM2 [LICENSE](LICENSE) 发布。对于使用 Meta Llama 3 构建的模型,还请遵守
61
+ [LLAMA3_LICENSE](LLAMA3_LICENSE)。
62
+
63
+ ## 引用
64
+
65
+ 我们即将发布技术报告,尽情期待。