Raincleared commited on
Commit
d040dc2
1 Parent(s): d9c623b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -17
README.md CHANGED
@@ -6,6 +6,7 @@ tags:
6
  - MiniCPM
7
  - ModelBest
8
  - THUNLP
 
9
  ---
10
 
11
 
@@ -25,7 +26,7 @@ In this work, we introduce a simple and effective sparsification method named "P
25
 
26
  ### Training Dataset
27
 
28
- We train the 1B model on about 473.02 billion tokens within 101,000 steps. These consist of 35,000 steps for standard ProSparse pre-training, 6,000 steps for decay, and 6,000 steps for SFT. Except for ProSparse, other training settings are highly consistent with the original [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). Refer to our [paper](https://arxiv.org/pdf/2402.13516.pdf) and [MiniCPM technical report](https://arxiv.org/pdf/2404.06395) for more details.
29
 
30
  Intuitively, training the model with even more tokens or with data of a wider coverage and higher quality will obtain better task-specific performance.
31
 
@@ -46,8 +47,8 @@ The hyper-parameters for each stage (including the regularization factor \\(\lam
46
  | 2 | \\(5e-3\\) | 20,000 | 98.30 |
47
  | 3 | \\(5e-3\\) | 25,000 | 122.88 |
48
  | 4 | \\(5e-2\\) | 35,000 | 172.03 |
49
- | decay | \\(5e-2\\)(fixed) | 95,000 | 466.94 |
50
- | SFT | \\(1e-2\\)(fixed) | 101,000 | 473.02 |
51
 
52
  ### Evaluation Results
53
 
@@ -63,19 +64,19 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
63
 
64
  **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
65
 
66
- | Setting | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval | Average<br>Performance | Average<br>Sparsity |
67
  | :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: |
68
- | LLaMA2-7B | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | 37.96 | - |
69
- | ReluLLaMA-7B | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | 37.62 | 66.98 |
70
- | **ProSparse-7B**\* | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | 38.31 | 88.11 |
71
- | **ProSparse-7B** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | **38.46** | **89.32** |
72
- | LLaMA2-13B | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | 44.06 | - |
73
- | ReluLLaMA-13B | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | 42.74 | 71.56 |
74
- | **ProSparse-13B**\* | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | **45.07** | 87.97 |
75
- | **ProSparse-13B** | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | 44.90 | **88.80** |
76
- | MiniCPM-1B | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 | 44.44 | - |
77
- | **ProSparse-1B**\* | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 | **44.72** | 86.25 |
78
- | **ProSparse-1B** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 | **44.72** | **87.89** |
79
 
80
  **Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. MiniCPM-1B is available at [1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). "ProSparse-7B\*", "ProSparse-13B\*", and "ProSparse-1B\*" denote the ProSparse versions without activation threshold shifting.
81
 
@@ -114,7 +115,7 @@ where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) de
114
 
115
  The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
116
 
117
- | Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Speedup<br>to Dense | `S2`<br>Time | Speedup<br>to Dense | `S3`<br/>Time | Speedup<br/>to Dense |
118
  | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
119
  | Dense-7B | - | - | - | 3.67 | 1.00 | 90.55 | 1.00 | 82.92 | 1.00 |
120
  | ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 3.10 | 67.12 | 1.35 | 63.00 | 1.32 |
@@ -163,4 +164,4 @@ Therefore, when using content generated by MiniCPM, users should take full respo
163
 
164
  #### Acknowledgments
165
 
166
- The model card is modified from [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16).
 
6
  - MiniCPM
7
  - ModelBest
8
  - THUNLP
9
+ license: apache-2.0
10
  ---
11
 
12
 
 
26
 
27
  ### Training Dataset
28
 
29
+ We train the 1B model on about 473.02 billion tokens within 101,000 steps. These consist of 35,000 steps for standard ProSparse pre-training, 60,000 steps for decay, and 6,000 steps for SFT. Except for ProSparse, other training settings are highly consistent with the original [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). Refer to our [paper](https://arxiv.org/pdf/2402.13516.pdf) and [MiniCPM technical report](https://arxiv.org/pdf/2404.06395) for more details.
30
 
31
  Intuitively, training the model with even more tokens or with data of a wider coverage and higher quality will obtain better task-specific performance.
32
 
 
47
  | 2 | \\(5e-3\\) | 20,000 | 98.30 |
48
  | 3 | \\(5e-3\\) | 25,000 | 122.88 |
49
  | 4 | \\(5e-2\\) | 35,000 | 172.03 |
50
+ | decay | \\(5e-2\\) (fixed) | 95,000 | 466.94 |
51
+ | SFT | \\(1e-2\\) (fixed) | 101,000 | 473.02 |
52
 
53
  ### Evaluation Results
54
 
 
64
 
65
  **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
66
 
67
+ | Setting | Average<br>Sparsity | Average<br>Performance | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval |
68
  | :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: |
69
+ | LLaMA2-7B | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 |
70
+ | ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 |
71
+ | **ProSparse-7B**\* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 |
72
+ | **ProSparse-7B** | **89.32** | **38.46** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 |
73
+ | LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 |
74
+ | ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 |
75
+ | **ProSparse-13B**\* | 87.97 | **45.07** | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 |
76
+ | **ProSparse-13B** | **88.80** | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 |
77
+ | MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 |
78
+ | **ProSparse-1B**\* | 86.25 | **44.72** | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 |
79
+ | **ProSparse-1B** | **87.89** | **44.72** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 |
80
 
81
  **Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. MiniCPM-1B is available at [1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). "ProSparse-7B\*", "ProSparse-13B\*", and "ProSparse-1B\*" denote the ProSparse versions without activation threshold shifting.
82
 
 
115
 
116
  The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
117
 
118
+ | Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Speedup<br>to Dense | `S2`<br>Time \\((\downarrow)\\) | Speedup<br>to Dense | `S3`<br/>Time \\((\downarrow)\\) | Speedup<br/>to Dense |
119
  | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
120
  | Dense-7B | - | - | - | 3.67 | 1.00 | 90.55 | 1.00 | 82.92 | 1.00 |
121
  | ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 3.10 | 67.12 | 1.35 | 63.00 | 1.32 |
 
164
 
165
  #### Acknowledgments
166
 
167
+ The model card is modified from [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16).