Spaces:

opencompass
/

open_vlm_leaderboard

Running on CPU Upgrade

kennymckormick commited on Mar 18

Commit

64d336c

•

1 Parent(s): a6e43e6

add OCRBench

Files changed (2) hide show

gen_table.py CHANGED Viewed

@@ -78,6 +78,8 @@ def BUILD_L1_DF(results, fields):
             res[d].append(item[d]['Overall'])
             if d == 'MME':
                 scores.append(item[d]['Overall'] / 28)
             else:
                 scores.append(item[d]['Overall'])
             ranks.append(nth_large(item[d]['Overall'], [x[d]['Overall'] for x in results.values()]))
@@ -106,6 +108,9 @@ def BUILD_L2_DF(results, dataset):
     if dataset == 'MME':
         non_overall_fields = [x for x in non_overall_fields if not listinstr(['Perception', 'Cognition'], x)]
         overall_fields = overall_fields + ['Perception', 'Cognition']
     for m in results:
         item = results[m]

             res[d].append(item[d]['Overall'])
             if d == 'MME':
                 scores.append(item[d]['Overall'] / 28)
+            elif d == 'OCRBench':
+                scores.append(item[d]['Final Score'] / 10)
             else:
                 scores.append(item[d]['Overall'])
             ranks.append(nth_large(item[d]['Overall'], [x[d]['Overall'] for x in results.values()]))
     if dataset == 'MME':
         non_overall_fields = [x for x in non_overall_fields if not listinstr(['Perception', 'Cognition'], x)]
         overall_fields = overall_fields + ['Perception', 'Cognition']
+    if dataset == 'OCRBench':
+        non_overall_fields = [x for x in non_overall_fields if not listinstr(['Final Score'], x)]
+        overall_fields = ['Final Score']
     for m in results:
         item = results[m]

meta_data.py CHANGED Viewed

@@ -124,4 +124,11 @@ LEADERBOARD_MD['ScienceQA_VAL'] = """
 - During evaluation, we use `GPT-3.5-Turbo-0613` as the choice extractor for all VLMs if the choice can not be extracted via heuristic matching. **Zero-shot** inference is adopted.
 """
-LEADERBOARD_MD['ScienceQA_TEST'] = LEADERBOARD_MD['ScienceQA_VAL']

 - During evaluation, we use `GPT-3.5-Turbo-0613` as the choice extractor for all VLMs if the choice can not be extracted via heuristic matching. **Zero-shot** inference is adopted.
 """
+LEADERBOARD_MD['ScienceQA_TEST'] = LEADERBOARD_MD['ScienceQA_VAL']
+LEADERBOARD_MD['OCRBench'] = """
+## OCRBench Evaluation Results
+- The evaluation of OCRBench is implemented by the official team: https://github.com/Yuliang-Liu/MultimodalOCR.
+- The performance of GPT4V might be underestimated: GPT4V rejects to answer 12 percent of the questions due to the policy of OpenAI. For those questions, the returned answer is "Your input image may contain content that is not allowed by our safety system."
+"""