NPHardEval leaderboard

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Init: to update with your specific keys
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("SAS", "weighted_accuracy", "SAS")
    task1 = Task("SPP", "weighted_accuracy", "SPP")
    task2 = Task("EDP", "weighted_accuracy", "EDP")
    task3 = Task("TSP_D", "weighted_accuracy", "TSP_D")
    task4 = Task("GCP_D", "weighted_accuracy", "GCP_D")
    task5 = Task("KSP", "weighted_accuracy", "KSP")
    task6 = Task("TSP", "weighted_accuracy", "TSP")
    task7 = Task("GCP", "weighted_accuracy", "GCP")
    task8 = Task("MSP", "weighted_accuracy", "MSP")


# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">NPHardEval leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
NPHardEval serves as a comprehensive benchmark for assessing the reasoning abilities of large language models (LLMs) through the lens of computational complexity classes.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
The paramount importance of complex reasoning in Large Language Models (LLMs) is well-recognized,
especially in their application to intricate decision-making tasks. This underscores the necessity
of thoroughly investigating LLMs' reasoning capabilities. To this end, various benchmarks have been
developed to evaluate these capabilities. However, existing benchmarks fall short in providing a
comprehensive assessment of LLMs' potential in reasoning. Additionally, there is a risk of overfitting,
as these benchmarks are static and publicly accessible, allowing models to tailor responses to specific
metrics, thus artificially boosting their performance.

In response, our research introduces 'NPHardEval,' a novel benchmark meticulously designed to
comprehensively evaluate LLMs' reasoning abilities. It comprises a diverse array of 900 algorithmic
questions, spanning the spectrum up to NP-Hard complexity. These questions are strategically selected
to cover a vast range of complexities, ensuring a thorough evaluation of LLMs' reasoning power. This
benchmark not only offers insights into the current state of reasoning in LLMs but also establishes
a benchmark for comparing LLMs' performance across various complexity classes.

[Our repository](https://github.com/casmlab/NPHardEval) contains datasets, data generation scripts, and experimental procedures designed to evaluate LLMs in various reasoning tasks.
In particular, we use three complexity classes to define the task complexity in the benchmark, including P (polynomial time), NP-complete (nondeterministic polynomial-time complete),
and NP-hard, which are increasingly complex in both the intrinsic difficulty and the resources needed to solve them. The selected nine problems are: 
  1) P problems: Sorted Array Search (SAS), Edit Distance Problem (EDP), Shortest Path Problem (SPP);
  2) NP-complete problems: Traveling Salesman Problem Decision Version (TSP-D), Graph Coloring Problem Decision Version (GCP-D), and Knapsack Problem (KSP);
  3) NP-hard problems:  Traveling Salesman Problem Optimization Version (TSP), Graph Coloring Problem Optimization Version (GCP), and Meeting Scheduling Problem (MSP).

The following figure shows their relation regarding computational complexity in an Euler diagram.

<div align="center">
    <img 
        src="https://raw.githubusercontent.com/casmlab/NPHardEval/main/NP-hard.jpg"
        style="width: 50%;"
        alt="Selected problems and the Euler diagram of computational complexity classes"
    >
</div>

Our benchmark offers several advantages compared with current benchmarks:
- Data construction grounded in the established computational complexity hierarchy
- Automatic checking mechanisms
- Automatic generation of datapoints
- Complete focus on reasoning while exclude numerical computation

Our study marks a significant contribution to understanding LLMs' current reasoning capabilities
and paves the way for future enhancements. Furthermore, NPHardEval features a dynamic update mechanism,
refreshing data points monthly. This approach is crucial in reducing the risk of model overfitting,
leading to a more accurate and dependable evaluation of LLMs' reasoning skills. The benchmark dataset
and the associated code are accessible at [NPHardEval GitHub Repository]("https://github.com/casmlab/NPHardEval").

## Quick Start
### Environment setup
```bash
conda create --name llm_reason python=3.10
conda activate llm_reason
git clone https://github.com/casmlab/NPHardEval.git
pip install -r requirements.txt
```

### Set-up API keys
Please set up your API keys in `secrets.txt`. **Please don't directly upload your keys to any public repository.**

### Example Commands
Let's use the GPT 4 Turbo model (GPT-4-1106-preview) and the EDP for example. 

For its zeroshot experiment, you can use:
```
cd run
cd run_close_zeroshot
python run_hard_GCP.py gpt-4-1106-preview
```

For its fewshot experiment, 
```
cd run
cd run_close_fewshot
python run_close_fewshot/run_hard_GCP.py gpt-4-1106-preview self
```
"""

EVALUATION_QUEUE_TEXT = """
Currently, we don't support the submission of new evaluations.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{fan2023nphardeval,
      title={NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes}, 
      author={Lizhou Fan and Wenyue Hua and Lingyao Li and Haoyang Ling and Yongfeng Zhang and Libby Hemphill},
      year={2023},
      eprint={2312.14890},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}
"""