File size: 7,991 Bytes
664416c
 
 
 
 
 
 
 
 
 
 
 
 
 
b5bca51
 
 
69aca8d
b5bca51
 
 
69aca8d
b5bca51
 
 
69aca8d
b5bca51
 
 
 
 
 
 
69aca8d
664416c
 
 
 
 
 
 
8b5c523
94f782b
 
 
 
 
 
664416c
 
 
8b5c523
664416c
 
 
 
94f782b
 
664416c
94f782b
8b5c523
664416c
8b5c523
664416c
94f782b
 
8b5c523
94f782b
664416c
94f782b
 
664416c
94f782b
 
 
664416c
94f782b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
664416c
94f782b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
664416c
94f782b
664416c
 
94f782b
664416c
7636bed
 
 
b5bca51
7636bed
 
 
 
664416c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("random_accuracy", "random_accuracy", "Accuracy (Random)")
    task2 = Task("popular_accuracy", "popular_accuracy", "Accuracy (Popular)")
    task1 = Task("adversarial_accuracy", "adversarial_accuracy", "Accuracy (Adversarial)")
    
    task7 = Task("random_precision", "random_precision", "Precision (Random)")
    task3 = Task("popular_precision", "popular_precision", "Precision (Popular)")
    task11 = Task("adversarial_precision", "adversarial_precision", "Precision (Adversarial)")
    
    task8 = Task("random_recall", "random_recall", "Recall (Random)")
    task4 = Task("popular_recall", "popular_recall", "Recall (Popular)")
    task12 = Task("adversarial_recall", "adversarial_recall", "Recall (Adversarial)")
    
    task9 = Task("random_f1_score", "random_f1_score", "F1 Score (Random)")
    task5 = Task("popular_f1_score", "popular_f1_score", "F1 Score (Popular)")
    task13 = Task("adversarial_f1_score", "adversarial_f1_score", "F1 Score (Adversarial)")

    task10 = Task("random_yes_percentage", "random_yes_percentage", "Yes Percent (Random)")
    task6 = Task("popular_yes_percentage", "popular_yes_percentage", "Yes Percent (Popular)")
    task14 = Task("adversarial_yes_percentage", "adversarial_yes_percentage", "Yes Percent (Adversarial)")
    

NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">πŸ πŸ’¬ 3D-POPE Leaderboard πŸ…</h1>
<p><center>
<a href="https://3d-grand.github.io/" target="_blank">[Project Page]</a>
<a href="https://www.dropbox.com/scl/fo/5p9nb4kalnz407sbqgemg/AG1KcxeIS_SUoJ1hoLPzv84?rlkey=weunabtbiz17jitfv3f4jpmm1&dl=0" target="_blank">[3D-GRAND Data]</a>
<a href="https://www.dropbox.com/scl/fo/inemjtgqt2nkckymn65rp/AGi2KSYU9AHbnpuj7TWYihs?rlkey=ldbn36b1z6nqj74yv5ph6cqwc&dl=0" target="_blank">[3D-POPE Data]</a>
</center></p>
"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
#### This is the official leaderboard for the 3D Polling-based Object Probing Evaluation (3D-POPE) benchmark. 3D-POPE is a benchmark to evaluate object hallucination in 3D LLMs from the work [3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination](https://3d-grand.github.io/).
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
# 3D-POPE: A Benchmark for Evaluating Hallucination in 3D-LLMs
### To systematically evaluate the hallucination behavior of 3D-LLMs, we introduce the 3D Polling-based Object Probing Evaluation (3D-POPE) benchmark. 3D-POPE is designed to assess a model's ability to accurately identify the presence or absence of objects in a given 3D scene.

## Dataset
To facilitate the 3D-POPE benchmark, we curate a dedicated dataset from the [ScanNet](https://arxiv.org/abs/1702.04405) dataset, utilizing the semantic classes from [ScanNet200](https://arxiv.org/abs/2204.07761). Specifically, we use the ScanNet validation set as the foundation for evaluating 3D-LLMs on the 3D-POPE benchmark.

Benchmark design. 3D-POPE consists of a set of triples, each comprising a 3D scene, a posed question, and a binary answer (β€œYes” or β€œNo”) indicating the presence or absence of an object. To ensure a balanced dataset, we maintain a 1:1 ratio of existent to nonexistent objects when constructing these triples. For the selection of negative samples (i.e., nonexistent objects), we employ three distinct sampling strategies:

β€’ Random Sampling: Nonexistent objects are randomly selected from the set of objects not present in the 3D scene.\n
β€’ Popular Sampling: We select the top-k most frequent objects not present in the 3D scene, where k equals the number of objects currently in the scene.\n
β€’ Adversarial Sampling: For each positively identified object in the scene, we rank objects that are not present and have not been used as negative samples based on their frequency of co-occurrence with the positive object in the training dataset. The highest-ranking co-occurring object is then selected as the adversarial sample. This approach differs from the original [POPE](https://arxiv.org/abs/2305.10355) to avoid adversarial samples mirroring popular samples, as indoor scenes often contain similar objects.\n
These sampling strategies are designed to challenge the model's robustness and assess its susceptibility to different levels of object hallucination.

## Metrics
To evaluate the model's performance on the 3D-POPE benchmark, we use key metrics including Precision, Recall, F1 Score, Accuracy, and Yes (%). Precision and Recall assess the model's ability to correctly affirm the presence of objects and identify the absence of objects, respectively. Precision is particularly important as it indicates the proportion of non-existing objects generated by the 3D-LLMs. The F1 Score, combining Precision and Recall, offers a balanced view of performance and serves as the primary evaluation metric. Accuracy measures the proportion of correctly answered questions, encompassing both β€œYes” and β€œNo” responses. Additionally, the Yes (%) metric reports the ratio of incorrect β€œYes” responses to understand the model’s tendencies regarding object hallucination.

## Leaderboard 
We establish a public leaderboard for the 3D-POPE benchmark, allowing researchers to submit their 3D-LLM results and compare their performance against other state-of-the-art models. The leaderboard reports the evaluation metrics for each model under the three sampling strategies, providing a transparent and standardized way to assess the hallucination performance of 3D-LLMs.
"""

EVALUATION_QUEUE_TEXT = """
# Submitting results from your own model

Read the below instructions **carefully** to ensure your submission is properly formatted and complete.

You should submit a total of 12 JSON files, each containing outputs generated by your model on [our dataset](https://www.dropbox.com/scl/fo/inemjtgqt2nkckymn65rp/AGi2KSYU9AHbnpuj7TWYihs?rlkey=ldbn36b1z6nqj74yv5ph6cqwc&dl=0). 
These files should be organized within a single ZIP file under the folder structure as follows:
```bash
❯ tree LEO/json-outputs
LEO/json-outputs
β”œβ”€β”€ adversarial_template_1.json
β”œβ”€β”€ adversarial_template_2.json
β”œβ”€β”€ adversarial_template_3.json
β”œβ”€β”€ adversarial_template_4.json
β”œβ”€β”€ popular_template_1.json
β”œβ”€β”€ popular_template_2.json
β”œβ”€β”€ popular_template_3.json
β”œβ”€β”€ popular_template_4.json
β”œβ”€β”€ random_template_1.json
β”œβ”€β”€ random_template_2.json
β”œβ”€β”€ random_template_3.json
└── random_template_4.json
```

Each JSON file should contain data structured exactly like the following example from random_template_1.json:
```json
{
  "random_template_1": [
    {
      "source": "scannet",
      "scene_id": "scene0011_00",
      "question": "Are there any people in the room?",
      "ground_truth_answer": "no",
      "predicted_answer": "yes",
      "template": "template_1",
      "question_type": "random"
    },
    ...
  ]
}
```

Submit your properly formatted ZIP file below and your model results will be added to the leaderboard.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite our work"
CITATION_BUTTON_TEXT = r"""
@misc{yang20243dgrand,
        title={3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs}, 
        author={Jianing Yang and Xuweiyi Chen and Nikhil Madaan and Madhavan Iyengar and Shengyi Qian and David F. Fouhey and Joyce Chai},
        year={2024},
        eprint={2406.05132},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }
"""