Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Junming Yang
commited on
Commit
•
50f8568
1
Parent(s):
63e7f75
update meta_data POPE
Browse files- meta_data.py +1 -0
meta_data.py
CHANGED
@@ -191,5 +191,6 @@ LEADERBOARD_MD['POPE'] = """
|
|
191 |
|
192 |
- POPE is a benchmark for object hallucination evaluation. It includes three tracks of object hallucination: random, popular, and adversarial.
|
193 |
- Note that the official POPE dataset contains approximately 8910 cases. POPE includes three tracks, and there are some overlapping samples among the three tracks. To reduce the data file size, we have kept only a single copy of the overlapping samples (about 5127 examples). However, the final accuracy will be calculated on the ~9k samples.
|
|
|
194 |
- We report the average F1 score across the three types of data as the overall score. Accuracy, precision, and recall are also shown in the table. F1 score = 2 * (precision * recall) / (precision + recall).
|
195 |
"""
|
|
|
191 |
|
192 |
- POPE is a benchmark for object hallucination evaluation. It includes three tracks of object hallucination: random, popular, and adversarial.
|
193 |
- Note that the official POPE dataset contains approximately 8910 cases. POPE includes three tracks, and there are some overlapping samples among the three tracks. To reduce the data file size, we have kept only a single copy of the overlapping samples (about 5127 examples). However, the final accuracy will be calculated on the ~9k samples.
|
194 |
+
- Some API models, due to safety policies, refuse to answer certain questions, so their actual capabilities may be higher than the reported scores.
|
195 |
- We report the average F1 score across the three types of data as the overall score. Accuracy, precision, and recall are also shown in the table. F1 score = 2 * (precision * recall) / (precision + recall).
|
196 |
"""
|