malhajar commited on
Commit
77ff79b
1 Parent(s): 35243bf

Update src/display/about.py

Browse files
Files changed (1) hide show
  1. src/display/about.py +4 -5
src/display/about.py CHANGED
@@ -40,20 +40,19 @@ LLM_BENCHMARKS_TEXT = f"""
40
  ## How it works
41
 
42
  ## Reproducibility
43
- To reproduce my results, here is the commands you can run:
44
 
45
  I use LM-Evaluation-Harness-Turkish, a version of the LM Evaluation Harness adapted for Turkish datasets, to ensure our leaderboard results are both reliable and replicable. Please see https://github.com/malhajar17/lm-evaluation-harness_turkish for more information
46
 
47
  ## How to Reproduce Results:
48
 
49
- ### 1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" @ https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
50
- ### 2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
51
  ```python
52
  lm_eval --model vllm --model_args pretrained=Trendyol/Trendyol-LLM-7b-chat-v1.0 --tasks truthfulqa_mc2_tr,truthfulqa_mc1_tr,mmlu_tr,winogrande_tr,gsm8k_tr,arc_challenge_tr,hellaswag_tr --output /workspace/Trendyol-LLM-7b-chat-v1.0
53
  ```
54
- ### 3) Report Results: I take the average of truthfulqa_mc1_tr and truthfulqa_mc2_tr scores and report it as truthfulqa. The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.
55
 
56
- Notes:
57
 
58
  - I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
59
  - All the tests are using "acc" as metric, with a plan to migrate to "acc_norm" for "ARC" and "Hellaswag" soon.
 
40
  ## How it works
41
 
42
  ## Reproducibility
 
43
 
44
  I use LM-Evaluation-Harness-Turkish, a version of the LM Evaluation Harness adapted for Turkish datasets, to ensure our leaderboard results are both reliable and replicable. Please see https://github.com/malhajar17/lm-evaluation-harness_turkish for more information
45
 
46
  ## How to Reproduce Results:
47
 
48
+ # 1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
49
+ # 2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
50
  ```python
51
  lm_eval --model vllm --model_args pretrained=Trendyol/Trendyol-LLM-7b-chat-v1.0 --tasks truthfulqa_mc2_tr,truthfulqa_mc1_tr,mmlu_tr,winogrande_tr,gsm8k_tr,arc_challenge_tr,hellaswag_tr --output /workspace/Trendyol-LLM-7b-chat-v1.0
52
  ```
53
+ # 3) Report Results: I take the average of truthfulqa_mc1_tr and truthfulqa_mc2_tr scores and report it as truthfulqa. The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.
54
 
55
+ ## Notes:
56
 
57
  - I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
58
  - All the tests are using "acc" as metric, with a plan to migrate to "acc_norm" for "ARC" and "Hellaswag" soon.