nitzanguetta commited on
Commit
6c7c7f6
1 Parent(s): 4dd5ec2

Upload Visual-Riddles-Leaderboard.tsv

Browse files
Files changed (1) hide show
  1. Visual-Riddles-Leaderboard.tsv +11 -11
Visual-Riddles-Leaderboard.tsv CHANGED
@@ -1,14 +1,14 @@
1
  Model Open Ended VQA: % Human Rating Multiple Choice VQA: % Accuracy Hints-Multiple Choice VQA: % Accuracy Attributions-Multiple Choice VQA: % Accuracy Refernce Based-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Refernce Free-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Automatic Evaluation: % Auto-Rater Ratings Hints-Automatic Evaluation: % Auto-Rater Ratings Attributions-Automatic Evaluation: % Auto-Rater Ratings
2
- Humans 82 78
3
  Gemini Pro 1.5 40 38 66 72 87 52 53 62 29
4
- Gemini Pro Vision 30 41 62 75 38 34 47
5
  GPT4 34 45 69 82 86 51 38 61 25
6
- LlaVA-1.6-34B 15 24 30 76 43 21 16
7
- LlaVA-1.5-7B 13 17 29 70 35 19 30
8
- InstructBlip 13 20 28
9
- Gemini Pro 1.5 Caption _ Gemini Pro 1.5 23
10
- Human (Oracle) Caption _ Gemini Pro 1.5 50
11
- Claude 3.5 Sonnet 46 45 39
12
- GPT4o 55 83 50
13
- Qwen-VL-Max 35 53 26
14
- Molmo-7B 34 42 36
 
1
  Model Open Ended VQA: % Human Rating Multiple Choice VQA: % Accuracy Hints-Multiple Choice VQA: % Accuracy Attributions-Multiple Choice VQA: % Accuracy Refernce Based-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Refernce Free-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Automatic Evaluation: % Auto-Rater Ratings Hints-Automatic Evaluation: % Auto-Rater Ratings Attributions-Automatic Evaluation: % Auto-Rater Ratings
2
+ Humans 82 * * * * * 78 * *
3
  Gemini Pro 1.5 40 38 66 72 87 52 53 62 29
4
+ Gemini Pro Vision 30 41 62 * 75 38 34 47
5
  GPT4 34 45 69 82 86 51 38 61 25
6
+ LlaVA-1.6-34B 15 24 30 * 76 43 21 16 *
7
+ LlaVA-1.5-7B 13 17 29 * 70 35 19 30 *
8
+ InstructBlip 13 * * * * * 20 28 *
9
+ Gemini Pro 1.5 Caption _ Gemini Pro 1.5 23 * * * * * * * *
10
+ Human (Oracle) Caption _ Gemini Pro 1.5 50 * * * * * * * *
11
+ Claude 3.5 Sonnet * 46 45 * * * 39 * *
12
+ GPT4o * 55 83 * * * 50 * *
13
+ Qwen-VL-Max * 35 53 * * * 26 * *
14
+ Molmo-7B * 34 42 * * * 36 * *