apsys commited on
Commit
be62b41
1 Parent(s): 0db2af0

Update src/display/about.py

Browse files
Files changed (1) hide show
  1. src/display/about.py +27 -60
src/display/about.py CHANGED
@@ -13,81 +13,48 @@ icons = f"""
13
  - {ModelType.merges.to_str(" : ")} model: merges or MoErges, models which have been merged or fused without additional fine-tuning.
14
  """
15
  LLM_BENCHMARKS_TEXT = """
16
- ## ABOUT
17
- With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
18
 
19
- 🤗 Submit a model for automated evaluation on the 🤗 GPU cluster on the "Submit" page!
20
- The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details below!
 
 
 
21
 
22
- ### Tasks
23
- 📈 We evaluate models on 6 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
24
 
25
- - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
26
- - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
27
- - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
28
- - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
29
- - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
30
- - <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
31
 
32
- For all these evaluations, a higher score is a better score.
33
- We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
34
 
35
- ### Results
36
- You can find:
37
- - detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/results
38
- - details on the input/outputs for the models in the `details` of each model, which you can access by clicking the 📄 emoji after the model name
39
- - community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/requests
40
-
41
- If a model's name contains "Flagged", this indicates it has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
42
-
43
- ---------------------------
44
-
45
- ## REPRODUCIBILITY
46
- To reproduce our results, here are the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
47
- `python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
48
- ` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
49
 
 
 
50
  ```
51
- python main.py --model=hf-causal-experimental \
52
- --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" \
53
- --tasks=<task_list> \
54
- --num_fewshot=<n_few_shot> \
55
- --batch_size=1 \
56
- --output_path=<output_path>
57
- ```
58
-
59
- **Note:** We evaluate all models on a single node of 8 H100s, so the global batch size is 8 for each evaluation. If you don't use parallelism, adapt your batch size to fit.
60
- *You can expect results to vary slightly for different batch sizes because of padding.*
61
 
62
- The tasks and few shots parameters are:
63
- - ARC: 25-shot, *arc-challenge* (`acc_norm`)
64
- - HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
65
- - TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
66
- - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
67
- - Winogrande: 5-shot, *winogrande* (`acc`)
68
- - GSM8k: 5-shot, *gsm8k* (`acc`)
69
 
70
- Side note on the baseline scores:
71
- - for log-likelihood evaluation, we select the random baseline
72
- - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
73
 
74
- ---------------------------
 
 
 
 
 
 
 
 
75
 
76
- ## RESOURCES
77
 
78
- ### Quantization
79
- To get more information about quantization, see:
80
- - 8 bits: [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), [paper](https://arxiv.org/abs/2208.07339)
81
- - 4 bits: [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), [paper](https://arxiv.org/abs/2305.14314)
82
 
83
- ### Useful links
84
- - [Community resources](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/174)
85
- - [Collection of best models](https://huggingface.co/collections/open-llm-leaderboard/llm-leaderboard-best-models-652d6c7965a4619fb5c27a03)
86
 
87
- ### Other cool leaderboards:
88
- - [LLM safety](https://huggingface.co/spaces/AI-Secure/llm-trustworthy-leaderboard)
89
- - [LLM performance](https://huggingface.co/spaces/optimum/llm-perf-leaderboard)
90
 
 
91
 
92
  """
93
 
 
13
  - {ModelType.merges.to_str(" : ")} model: merges or MoErges, models which have been merged or fused without additional fine-tuning.
14
  """
15
  LLM_BENCHMARKS_TEXT = """
 
 
16
 
17
+ Маленький Шлепа это бенчмарк для LLM с задачами множественного выбора (multichoice) по следующим темам:
18
+ - Законы Российской Федерации (lawmc)
19
+ - Популярная музыка (musicmc)
20
+ - Книги (bookmc)
21
+ - Фильмы (moviemc)
22
 
23
+ Каждая задача содержит 12 вариантов ответа.
 
24
 
25
+ ## Инструкция по использованию
 
 
 
 
 
26
 
27
+ ### Установка
 
28
 
29
+ Для установки необходимой библиотеки выполните следующую команду:
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
+ ```bash
32
+ pip install git+https://github.com/VikhrModels/lm_eval_mc.git --upgrade --force-reinstall --no-deps
33
  ```
 
 
 
 
 
 
 
 
 
 
34
 
35
+ ### Запуск
 
 
 
 
 
 
36
 
37
+ Для запуска бенча используйте следующую команду:
 
 
38
 
39
+ ```bash
40
+ !lm_eval \
41
+ --model hf \
42
+ --model_args pretrained={ваша модель, в формате transformers},dtype=float16 \
43
+ --device 0 \
44
+ --batch_size 4 \
45
+ --tasks musicmc,moviemc,bookmc,lawmc \ # здесь нельзя ничего менять, мы не принимаем частичный сабмишен
46
+ --output_path output/{папка с результатми}
47
+ ```
48
 
49
+ ### Результаты
50
 
51
+ После выполнения команды выше, в каталоге `output` будет создан файл в формате json, его необходимо прикрепить. Этот файл содержит результаты выполнения задач и описание сессии, его **нельзя модифицировать**.
 
 
 
52
 
53
+ ## Политика против читерства
 
 
54
 
55
+ При обнаружении читерства или попыток модификации выходного файла, мы оставляем за собой право удалить ваш сабмишен.
 
 
56
 
57
+ Спасибо за участие!
58
 
59
  """
60