Raincleared commited on
Commit
48ed02e
1 Parent(s): cab5d7d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +1 -11
README.md CHANGED
@@ -66,16 +66,6 @@ The 7B model is trained on 8 A100 GPUs. The learning rate (LR) is controlled by
66
  | 4 | \\(5e-1\\) | 16,000 | 33.55 |
67
  | 5 | \\(5e-1\\) | 16,500 | 34.60 |
68
 
69
- ### Evaluation Benckmarks
70
-
71
- - **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).
72
-
73
- - **Commonsense Reasoning**: We report the average 0-shot perplexity (PPL) on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.
74
-
75
- - **Reading Comprehension**: We compute the average 0-shot PPL on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
76
-
77
- - **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and the average PPL on AGI-Eval (0-shot).
78
-
79
  ### Evaluation Results
80
 
81
  The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Our evaluation is based on the framework [UltraEval](https://github.com/OpenBMB/UltraEval). The evaluation details are listed as follows:
@@ -86,7 +76,7 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
86
 
87
  - **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
88
 
89
- - **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot). Refer to Appendix~\ref{sec:eval-details} for more details.
90
 
91
  **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
92
 
 
66
  | 4 | \\(5e-1\\) | 16,000 | 33.55 |
67
  | 5 | \\(5e-1\\) | 16,500 | 34.60 |
68
 
 
 
 
 
 
 
 
 
 
 
69
  ### Evaluation Results
70
 
71
  The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Our evaluation is based on the framework [UltraEval](https://github.com/OpenBMB/UltraEval). The evaluation details are listed as follows:
 
76
 
77
  - **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
78
 
79
+ - **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot).
80
 
81
  **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
82