open_cn_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

xuanricheng commited on May 16

Commit

831e1f2

•

1 Parent(s): 175efb2

update about

Browse files

Files changed (1) hide show

src/display/about.py +18 -11

src/display/about.py CHANGED Viewed

@@ -3,17 +3,25 @@ from src.display.utils import ModelType
 TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
 INTRODUCTION_TEXT = """
-Open Chinese LLM Leaderboard 旨在跟踪、排名和评估开放式中文大语言模型（LLM）。本排行榜由FlagEval平台提供相应算力和运行环境。
 评估数据集是全部都是中文数据集以评估中文能力如需查看详情信息，请查阅‘关于’页面。
-如需对模型进行更全面的评测，可以登录FlagEval平台，体验更加完善的模型评测功能。
-The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the [FlagEval](https://flageval.baai.ac.cn/) platform, providing corresponding computational resources and runtime environment.
 The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
 For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
 """
 LLM_BENCHMARKS_TEXT = f"""
 # Context
 Open Chinese LLM Leaderboard是中文大语言排行榜，我们希望能够推动更加开放的生态，让中文大语言模型开发者参与进来，为推动中文的大语言模型进步做出相应的贡献。
 为了实现公平性的目标，所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估，以确保公平性。
@@ -23,7 +31,7 @@ In pursuit of fairness, all models undergo evaluation on the FlagEval platform u
 ## How it works
-📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
 - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  ARC Challenge </a> (25-shot) - a set of grade-school science questions.
 - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
@@ -38,16 +46,15 @@ We chose these benchmarks as they test a variety of reasoning and general knowle
 ## Details and logs
 You can find:
-- detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/results
-- details on the input/outputs for the models in the `details` of each model, that you can access by clicking the 📄 emoji after the model name
-- community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/requests
 ## Reproducibility
 To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
 `python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
 ` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
-The total batch size we get for models which fit on one A100 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit.
 *You can expect results to vary slightly for different batch sizes because of padding.*
 The tasks and few shots parameters are:
@@ -136,9 +143,9 @@ I have an issue about accessing the leaderboard through the Gradio API
 EVALUATION_QUEUE_TEXT = """
-# Evaluation Queue for the 🤗 Open LLM Leaderboard
-Models added here will be automatically evaluated on the 🤗 cluster.
 ## First steps before submitting a model
@@ -158,7 +165,7 @@ Note: if your model needs `use_remote_code=True`, we do not support this option
 It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
 ### 3) Make sure your model has an open license!
-This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
 ### 4) Fill up your model card
 When we add extra information about models to the leaderboard, it will be automatically taken from the model card

 TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
 INTRODUCTION_TEXT = """
+Open Chinese LLM Leaderboard 旨在跟踪、排名和评估开放式中文大语言模型（LLM）。本排行榜由FlagEval台提供相应算力和运行环境。
 评估数据集是全部都是中文数据集以评估中文能力如需查看详情信息，请查阅‘关于’页面。
+如需对模型进行更全面的评测，可以登录 [FlagEval](https://flageval.baai.ac.cn/)平台，体验更加完善的模型评测功能。
+The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the FlagEval platform, providing corresponding computational resources and runtime environment.
 The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
 For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
 """
 LLM_BENCHMARKS_TEXT = f"""
+# The Goal of Open CN-LLM Leaderboard
+感谢您积极的参与评测，在未来，我们会持续推动 Open Chinese Leaderboard 更加完善，维护生态开放，欢迎开发者参与评测方法、工具和数据集的探讨，让我们一起建设更加科学和公正的榜单。
+Thank you for actively participating in the evaluation. In the future, we will continue to enhance the Open Chinese Leaderboard, maintaining an open ecosystem.
+We welcome developers to engage in discussions regarding evaluation methods, tools, and datasets, aiming to collectively build a more scientific and fair leaderboard.
 # Context
 Open Chinese LLM Leaderboard是中文大语言排行榜，我们希望能够推动更加开放的生态，让中文大语言模型开发者参与进来，为推动中文的大语言模型进步做出相应的贡献。
 为了实现公平性的目标，所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估，以确保公平性。
 ## How it works
+We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
 - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  ARC Challenge </a> (25-shot) - a set of grade-school science questions.
 - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 ## Details and logs
 You can find:
+- detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/results
+- community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/requests
 ## Reproducibility
 To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
 `python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
 ` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
+The total batch size we get for models which fit on one A800 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit.
 *You can expect results to vary slightly for different batch sizes because of padding.*
 The tasks and few shots parameters are:
 EVALUATION_QUEUE_TEXT = """
+# Evaluation Queue for theOpen Chinese LLM Leaderboard
+Models added here will be automatically evaluated on the  FlagEval cluster.
 ## First steps before submitting a model
 It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
 ### 3) Make sure your model has an open license!
+This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model
 ### 4) Fill up your model card
 When we add extra information about models to the leaderboard, it will be automatically taken from the model card