Spaces:
Sleeping
Sleeping
add paper
Browse files- text_content.py +2 -2
text_content.py
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
HEAD_TEXT = """
|
2 |
This is the official leaderboard for 🏅StructEval benchmark. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.
|
3 |
|
4 |
-
Please refer to 🐱[StructEval repository](https://github.com/c-box/StructEval) for model evaluation and 📖[our paper]() for experimental analysis.
|
5 |
|
6 |
🚀 **_Latest News_**
|
7 |
* [2024.8.6] We released the first version of StructEval leaderboard, which includes 22 open-sourced language models, more datasets and models are comming soon🔥🔥🔥.
|
@@ -37,6 +37,6 @@ Inspired from the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/Hugg
|
|
37 |
|
38 |
NOTES_TEXT = """
|
39 |
* Base benchmark refers to the original dataset, while struct benchmarks refer to the benchmarks constructed using StructEval with these base benchmarks as seed data.
|
40 |
-
* On most models on base MMLU, we collected the results for their official technical report. For the models that have not been reported, we use opencompass for evaluation.
|
41 |
* For other 2 base benchmarks and all 3 structured benchmarks: for chat models, we evaluate them under 0-shot setting; for completion model, we evaluate them under 0-shot setting with ppl. And we keep the prompt format consistent across all benchmarks.
|
42 |
"""
|
|
|
1 |
HEAD_TEXT = """
|
2 |
This is the official leaderboard for 🏅StructEval benchmark. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.
|
3 |
|
4 |
+
Please refer to 🐱[StructEval repository](https://github.com/c-box/StructEval) for model evaluation and 📖[our paper](https://arxiv.org/abs/2408.03281) for experimental analysis.
|
5 |
|
6 |
🚀 **_Latest News_**
|
7 |
* [2024.8.6] We released the first version of StructEval leaderboard, which includes 22 open-sourced language models, more datasets and models are comming soon🔥🔥🔥.
|
|
|
37 |
|
38 |
NOTES_TEXT = """
|
39 |
* Base benchmark refers to the original dataset, while struct benchmarks refer to the benchmarks constructed using StructEval with these base benchmarks as seed data.
|
40 |
+
* On most models on base MMLU, we collected the results for their official technical report. For the models that have not been reported, we use [opencompass](https://opencompass.org.cn/home) for evaluation.
|
41 |
* For other 2 base benchmarks and all 3 structured benchmarks: for chat models, we evaluate them under 0-shot setting; for completion model, we evaluate them under 0-shot setting with ppl. And we keep the prompt format consistent across all benchmarks.
|
42 |
"""
|