Upload 15 files
Browse files- constants.py +2 -2
constants.py
CHANGED
@@ -34,7 +34,7 @@ LEADERBORAD_INTRODUCTION = """# SEED-Bench Leaderboard
|
|
34 |
SEED-Bench-1 consists of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both the spatial and temporal understanding.
|
35 |
Please refer to [SEED-Bench-1 paper](https://arxiv.org/abs/2307.16125) for more details.
|
36 |
|
37 |
-
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human
|
38 |
Please refer to [SEED-Bench-2 paper](https://arxiv.org/abs/2311.17092) for more details.
|
39 |
"""
|
40 |
|
@@ -104,7 +104,7 @@ TABLE_INTRODUCTION = """In the table below, we summarize each task performance o
|
|
104 |
LEADERBORAD_INFO = """
|
105 |
Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.
|
106 |
[SEED-Bench-1](https://arxiv.org/abs/2307.16125) consists of 19K multiple choice questions with accurate human annotations (x6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality.
|
107 |
-
[SEED-Bench-2](https://arxiv.org/abs/2311.17092) comprises 24K multiple-choice questions with accurate human
|
108 |
We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes.
|
109 |
Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation.
|
110 |
By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research.
|
|
|
34 |
SEED-Bench-1 consists of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both the spatial and temporal understanding.
|
35 |
Please refer to [SEED-Bench-1 paper](https://arxiv.org/abs/2307.16125) for more details.
|
36 |
|
37 |
+
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation.
|
38 |
Please refer to [SEED-Bench-2 paper](https://arxiv.org/abs/2311.17092) for more details.
|
39 |
"""
|
40 |
|
|
|
104 |
LEADERBORAD_INFO = """
|
105 |
Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.
|
106 |
[SEED-Bench-1](https://arxiv.org/abs/2307.16125) consists of 19K multiple choice questions with accurate human annotations (x6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality.
|
107 |
+
[SEED-Bench-2](https://arxiv.org/abs/2311.17092) comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation.
|
108 |
We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes.
|
109 |
Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation.
|
110 |
By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research.
|