File size: 6,493 Bytes
efeee6d
314f91a
95f85ed
efeee6d
 
 
 
 
 
314f91a
efeee6d
 
943f952
4b33d2c
 
 
 
 
 
 
 
 
efeee6d
 
 
4b33d2c
58733e4
efeee6d
8c49cb6
c36313e
 
5dec839
c36313e
 
 
 
4b33d2c
e7a7ef0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b33d2c
 
 
 
 
 
 
 
 
 
 
c36313e
4b33d2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d313dbd
b323764
4b33d2c
 
d313dbd
4b33d2c
 
d313dbd
4b33d2c
 
 
 
 
 
b323764
4b33d2c
 
 
 
 
 
 
d313dbd
4b33d2c
 
58733e4
2a73469
a74b942
 
217b585
4b33d2c
 
 
 
 
 
 
 
9833cdb
a74b942
 
f0765d3
a74b942
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Init: to update with your specific keys
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("SAS", "weighted_accuracy", "SAS")
    task1 = Task("SPP", "weighted_accuracy", "SPP")
    task2 = Task("EDP", "weighted_accuracy", "EDP")
    task3 = Task("TSP_D", "weighted_accuracy", "TSP_D")
    task4 = Task("GCP_D", "weighted_accuracy", "GCP_D")
    task5 = Task("KSP", "weighted_accuracy", "KSP")
    task6 = Task("TSP", "weighted_accuracy", "TSP")
    task7 = Task("GCP", "weighted_accuracy", "GCP")
    task8 = Task("MSP", "weighted_accuracy", "MSP")


# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">NPHardEval leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
<div align="center">
    <img 
        src="https://raw.githubusercontent.com/casmlab/NPHardEval/main/figure/NPHardEval_text_right.png"
        style="width: 80%;"
        alt="Selected problems and the Euler diagram of computational complexity classes"
    >
</div>
NPHardEval serves as a comprehensive benchmark for assessing the reasoning abilities of large language models (LLMs) through the lens of computational complexity classes.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
The paramount importance of complex reasoning in Large Language Models (LLMs) is well-recognized,
especially in their application to intricate decision-making tasks. This underscores the necessity
of thoroughly investigating LLMs' reasoning capabilities. To this end, various benchmarks have been
developed to evaluate these capabilities. However, existing benchmarks fall short in providing a
comprehensive assessment of LLMs' potential in reasoning. Additionally, there is a risk of overfitting,
as these benchmarks are static and publicly accessible, allowing models to tailor responses to specific
metrics, thus artificially boosting their performance.

In response, our research introduces 'NPHardEval,' a novel benchmark meticulously designed to
comprehensively evaluate LLMs' reasoning abilities. It comprises a diverse array of 900 algorithmic
questions, spanning the spectrum up to NP-Hard complexity. These questions are strategically selected
to cover a vast range of complexities, ensuring a thorough evaluation of LLMs' reasoning power. This
benchmark not only offers insights into the current state of reasoning in LLMs but also establishes
a benchmark for comparing LLMs' performance across various complexity classes.

[Our repository](https://github.com/casmlab/NPHardEval) contains datasets, data generation scripts, and experimental procedures designed to evaluate LLMs in various reasoning tasks.
In particular, we use three complexity classes to define the task complexity in the benchmark, including P (polynomial time), NP-complete (nondeterministic polynomial-time complete),
and NP-hard, which are increasingly complex in both the intrinsic difficulty and the resources needed to solve them. The selected nine problems are: 
  1) P problems: Sorted Array Search (SAS), Edit Distance Problem (EDP), Shortest Path Problem (SPP);
  2) NP-complete problems: Traveling Salesman Problem Decision Version (TSP-D), Graph Coloring Problem Decision Version (GCP-D), and Knapsack Problem (KSP);
  3) NP-hard problems:  Traveling Salesman Problem Optimization Version (TSP), Graph Coloring Problem Optimization Version (GCP), and Meeting Scheduling Problem (MSP).

The following figure shows their relation regarding computational complexity in an Euler diagram.

<div align="center">
    <img 
        src="https://raw.githubusercontent.com/casmlab/NPHardEval/main/figure/NP-hard.jpg"
        style="width: 50%;"
        alt="Selected problems and the Euler diagram of computational complexity classes"
    >
</div>

Our benchmark offers several advantages compared with current benchmarks:
- Data construction grounded in the established computational complexity hierarchy
- Automatic checking mechanisms
- Automatic generation of datapoints
- Complete focus on reasoning while exclude numerical computation

Our study marks a significant contribution to understanding LLMs' current reasoning capabilities
and paves the way for future enhancements. Furthermore, NPHardEval features a dynamic update mechanism,
refreshing data points monthly. This approach is crucial in reducing the risk of model overfitting,
leading to a more accurate and dependable evaluation of LLMs' reasoning skills. The benchmark dataset
and the associated code are accessible at [NPHardEval GitHub Repository]("https://github.com/casmlab/NPHardEval").

## Quick Start
### Environment setup
```bash
conda create --name llm_reason python=3.10
conda activate llm_reason
git clone https://github.com/casmlab/NPHardEval.git
pip install -r requirements.txt
```

### Set-up API keys
Please set up your API keys in `secrets.txt`. **Please don't directly upload your keys to any public repository.**

### Example Commands
Let's use the GPT 4 Turbo model (GPT-4-1106-preview) and the EDP for example. 

For its zeroshot experiment, you can use:
```
cd run
cd run_close_zeroshot
python run_hard_GCP.py gpt-4-1106-preview
```

For its fewshot experiment, 
```
cd run
cd run_close_fewshot
python run_close_fewshot/run_hard_GCP.py gpt-4-1106-preview self
```
"""

EVALUATION_QUEUE_TEXT = """
Currently, we don't support the submission of new evaluations.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results."

CITATION_BUTTON_TEXT = r"""
@misc{fan2023nphardeval,
      title={NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes}, 
      author={Lizhou Fan and Wenyue Hua and Lingyao Li and Haoyang Ling and Yongfeng Zhang and Libby Hemphill},
      year={2023},
      eprint={2312.14890},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}
"""

COMMENT_BUTTON_TEXT = """
We refer to the following links to mark the model parameters. Contact us if you find any issues.
- https://grabon.com/blog/claude-users-statistics/
- https://medium.com/@seanbetts/peering-inside-gpt-4-understanding-its-mixture-of-experts-moe-architecture-2a42eb8bdcb3
- https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html
"""