Spaces:

AILab-CVC
/

SEED-Bench_Leaderboard

Running

App Files Files Community

tttoaster commited on Dec 28, 2023

Commit

d0b41f3

•

1 Parent(s): 04a0fb2

Upload 15 files

Browse files

Files changed (1) hide show

constants.py +2 -2

constants.py CHANGED Viewed

@@ -34,7 +34,7 @@ LEADERBORAD_INTRODUCTION = """# SEED-Bench Leaderboard
     SEED-Bench-1 consists of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both the spatial and temporal understanding.
     Please refer to [SEED-Bench-1 paper](https://arxiv.org/abs/2307.16125) for more details.
-    SEED-Bench-2 comprises 24K multiple-choice questions with accurate human anno- tations, which spans 27 dimensions, including the evalu- ation of both text and image generation.
     Please refer to [SEED-Bench-2 paper](https://arxiv.org/abs/2311.17092) for more details.
     """
@@ -104,7 +104,7 @@ TABLE_INTRODUCTION = """In the table below, we summarize each task performance o
 LEADERBORAD_INFO = """
       Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.
       [SEED-Bench-1](https://arxiv.org/abs/2307.16125) consists of 19K multiple choice questions with accurate human annotations (x6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality.
-      [SEED-Bench-2](https://arxiv.org/abs/2311.17092)  comprises 24K multiple-choice questions with accurate human anno- tations, which spans 27 dimensions, including the evalu- ation of both text and image generation.
       We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes.
       Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation.
       By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research.

     SEED-Bench-1 consists of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both the spatial and temporal understanding.
     Please refer to [SEED-Bench-1 paper](https://arxiv.org/abs/2307.16125) for more details.
+    SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation.
     Please refer to [SEED-Bench-2 paper](https://arxiv.org/abs/2311.17092) for more details.
     """
 LEADERBORAD_INFO = """
       Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.
       [SEED-Bench-1](https://arxiv.org/abs/2307.16125) consists of 19K multiple choice questions with accurate human annotations (x6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality.
+      [SEED-Bench-2](https://arxiv.org/abs/2311.17092)  comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation.
       We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes.
       Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation.
       By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research.