SparseLLM
/

ProSparse-MiniCPM-1B-sft

@@ -6,6 +6,7 @@ tags:
 - MiniCPM
 - ModelBest
 - THUNLP
 ---
@@ -25,7 +26,7 @@ In this work, we introduce a simple and effective sparsification method named "P
 ### Training Dataset
-We train the 1B model on about 473.02 billion tokens within 101,000 steps. These consist of 35,000 steps for standard ProSparse pre-training, 6,000 steps for decay, and 6,000 steps for SFT. Except for ProSparse, other training settings are highly consistent with the original [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). Refer to our [paper](https://arxiv.org/pdf/2402.13516.pdf) and [MiniCPM technical report](https://arxiv.org/pdf/2404.06395) for more details.
 Intuitively, training the model with even more tokens or with data of a wider coverage and higher quality will obtain better task-specific performance.
@@ -46,8 +47,8 @@ The hyper-parameters for each stage (including the regularization factor \\(\lam
 |        2        |   \\(5e-3\\)    | 20,000 |         98.30          |
 |        3        |   \\(5e-3\\)    | 25,000 |         122.88          |
 |        4        |   \\(5e-2\\)    | 35,000 |         172.03          |
-|      decay      |   \\(5e-2\\)(fixed)    | 95,000 |         466.94          |
-|       SFT       |   \\(1e-2\\)(fixed)    | 101,000 |         473.02          |
 ### Evaluation Results
@@ -63,19 +64,19 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
 **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
-|        Setting        | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU  |  BBH  | AGI Eval | Average<br>Performance | Average<br>Sparsity |
 | :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: |
-| LLaMA2-7B    | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | 37.96 | - |
-| ReluLLaMA-7B | 15.85 | 69.64 | 70.54 |  5.84 | 38.64 | 35.07 | 27.73 | 37.62 | 66.98 |
-| **ProSparse-7B**\* | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | 38.31 | 88.11 |
-| **ProSparse-7B**   | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | **38.46** | **89.32** |
-| LLaMA2-13B | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | 44.06 | - |
-| ReluLLaMA-13B | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | 42.74 | 71.56 |
-| **ProSparse-13B**\* | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | **45.07** | 87.97 |
-| **ProSparse-13B**   | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | 44.90 | **88.80** |
-| MiniCPM-1B | 36.85  | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 | 44.44 | - |
-| **ProSparse-1B**\*  | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 | **44.72** | 86.25 |
-| **ProSparse-1B**    | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 | **44.72** | **87.89** |
 **Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. MiniCPM-1B is available at [1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). "ProSparse-7B\*", "ProSparse-13B\*", and "ProSparse-1B\*" denote the ProSparse versions without activation threshold shifting.
@@ -114,7 +115,7 @@ where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) de
 The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
-|        Setting        | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Speedup<br>to Dense | `S2`<br>Time | Speedup<br>to Dense | `S3`<br/>Time | Speedup<br/>to Dense |
 | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
 | Dense-7B | - | - | - | 3.67 | 1.00 | 90.55 | 1.00 | 82.92 | 1.00 |
 |     ReluLLaMA-7B      |        66.98        |        90.89         |         58.95         |        11.37        | 3.10 |      67.12       |        1.35         |       63.00       |         1.32         |
@@ -163,4 +164,4 @@ Therefore, when using content generated by MiniCPM, users should take full respo
 #### Acknowledgments
-The model card is modified from [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16).

 - MiniCPM
 - ModelBest
 - THUNLP
+license: apache-2.0
 ---
 ### Training Dataset
+We train the 1B model on about 473.02 billion tokens within 101,000 steps. These consist of 35,000 steps for standard ProSparse pre-training, 60,000 steps for decay, and 6,000 steps for SFT. Except for ProSparse, other training settings are highly consistent with the original [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). Refer to our [paper](https://arxiv.org/pdf/2402.13516.pdf) and [MiniCPM technical report](https://arxiv.org/pdf/2404.06395) for more details.
 Intuitively, training the model with even more tokens or with data of a wider coverage and higher quality will obtain better task-specific performance.
 |        2        |   \\(5e-3\\)    | 20,000 |         98.30          |
 |        3        |   \\(5e-3\\)    | 25,000 |         122.88          |
 |        4        |   \\(5e-2\\)    | 35,000 |         172.03          |
+|      decay      |   \\(5e-2\\) (fixed)    | 95,000 |         466.94          |
+|       SFT       |   \\(1e-2\\) (fixed)    | 101,000 |         473.02          |
 ### Evaluation Results
 **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
+|        Setting        | Average<br>Sparsity | Average<br>Performance | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU  |  BBH  | AGI Eval |
 | :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: |
+| LLaMA2-7B    | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 |
+| ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 |  5.84 | 38.64 | 35.07 | 27.73 |
+| **ProSparse-7B**\* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 |
+| **ProSparse-7B**   | **89.32** | **38.46** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 |
+| LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 |
+| ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 |
+| **ProSparse-13B**\* | 87.97 | **45.07** | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 |
+| **ProSparse-13B**   | **88.80** | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 |
+| MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 |
+| **ProSparse-1B**\*  | 86.25 | **44.72** | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 |
+| **ProSparse-1B**    | **87.89** | **44.72** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 |
 **Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. MiniCPM-1B is available at [1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). "ProSparse-7B\*", "ProSparse-13B\*", and "ProSparse-1B\*" denote the ProSparse versions without activation threshold shifting.
 The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
+|        Setting        | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Speedup<br>to Dense | `S2`<br>Time \\((\downarrow)\\) | Speedup<br>to Dense | `S3`<br/>Time \\((\downarrow)\\) | Speedup<br/>to Dense |
 | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
 | Dense-7B | - | - | - | 3.67 | 1.00 | 90.55 | 1.00 | 82.92 | 1.00 |
 |     ReluLLaMA-7B      |        66.98        |        90.89         |         58.95         |        11.37        | 3.10 |      67.12       |        1.35         |       63.00       |         1.32         |
 #### Acknowledgments
+The model card is modified from [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16).