nitzanguetta
commited on
Commit
•
6c7c7f6
1
Parent(s):
4dd5ec2
Upload Visual-Riddles-Leaderboard.tsv
Browse files- Visual-Riddles-Leaderboard.tsv +11 -11
Visual-Riddles-Leaderboard.tsv
CHANGED
@@ -1,14 +1,14 @@
|
|
1 |
Model Open Ended VQA: % Human Rating Multiple Choice VQA: % Accuracy Hints-Multiple Choice VQA: % Accuracy Attributions-Multiple Choice VQA: % Accuracy Refernce Based-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Refernce Free-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Automatic Evaluation: % Auto-Rater Ratings Hints-Automatic Evaluation: % Auto-Rater Ratings Attributions-Automatic Evaluation: % Auto-Rater Ratings
|
2 |
-
Humans 82
|
3 |
Gemini Pro 1.5 40 38 66 72 87 52 53 62 29
|
4 |
-
Gemini Pro Vision 30 41 62
|
5 |
GPT4 34 45 69 82 86 51 38 61 25
|
6 |
-
LlaVA-1.6-34B 15 24 30
|
7 |
-
LlaVA-1.5-7B 13 17 29
|
8 |
-
InstructBlip 13
|
9 |
-
Gemini Pro 1.5 Caption _ Gemini Pro 1.5 23
|
10 |
-
Human (Oracle) Caption _ Gemini Pro 1.5 50
|
11 |
-
Claude 3.5 Sonnet
|
12 |
-
GPT4o
|
13 |
-
Qwen-VL-Max
|
14 |
-
Molmo-7B
|
|
|
1 |
Model Open Ended VQA: % Human Rating Multiple Choice VQA: % Accuracy Hints-Multiple Choice VQA: % Accuracy Attributions-Multiple Choice VQA: % Accuracy Refernce Based-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Refernce Free-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Automatic Evaluation: % Auto-Rater Ratings Hints-Automatic Evaluation: % Auto-Rater Ratings Attributions-Automatic Evaluation: % Auto-Rater Ratings
|
2 |
+
Humans 82 * * * * * 78 * *
|
3 |
Gemini Pro 1.5 40 38 66 72 87 52 53 62 29
|
4 |
+
Gemini Pro Vision 30 41 62 * 75 38 34 47
|
5 |
GPT4 34 45 69 82 86 51 38 61 25
|
6 |
+
LlaVA-1.6-34B 15 24 30 * 76 43 21 16 *
|
7 |
+
LlaVA-1.5-7B 13 17 29 * 70 35 19 30 *
|
8 |
+
InstructBlip 13 * * * * * 20 28 *
|
9 |
+
Gemini Pro 1.5 Caption _ Gemini Pro 1.5 23 * * * * * * * *
|
10 |
+
Human (Oracle) Caption _ Gemini Pro 1.5 50 * * * * * * * *
|
11 |
+
Claude 3.5 Sonnet * 46 45 * * * 39 * *
|
12 |
+
GPT4o * 55 83 * * * 50 * *
|
13 |
+
Qwen-VL-Max * 35 53 * * * 26 * *
|
14 |
+
Molmo-7B * 34 42 * * * 36 * *
|