SparseLLM
/

prosparse-llama-2-7b

@@ -85,15 +85,15 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
 |     ReluLLaMA-7B      |        66.98        |       15.85        |          69.64           |          70.54           | 5.84  | 38.64 | 35.07 |    27.73    |  37.62  |
 |    Vanilla ReLU-7B    |        66.04        |       21.31        |          70.73           |          73.22           | 11.22 | 49.22 | 36.11 |    28.01    |  41.40  |
 |    Shifted ReLU-7B    |        69.59        |       20.50        |          70.09           |          73.17           | 13.87 | 48.54 | 35.20 |    27.94    |  41.33  |
-|    Fixed $L_1$-7B     |        91.46        |       18.85        |          66.01           |          55.39           | 2.27  | 32.28 | 31.40 |    26.48    |  33.24  |
-| **ProSparse-7B**$^*$  |        88.11        |       19.47        |          66.29           |          63.33           | 12.74 | 45.21 | 33.59 |    27.55    |  38.31  |
 |   **ProSparse-7B**    |        89.32        |       19.42        |          66.27           |          63.50           | 12.13 | 45.48 | 34.99 |    27.46    |  38.46  |
 |     Original-13B      |          -          |       20.19        |          72.58           |          71.55           | 22.21 | 54.69 | 37.89 |    29.33    |  44.06  |
 |     ReluLLaMA-13B     |        71.56        |       20.19        |          70.44           |          73.29           | 18.50 | 50.58 | 37.97 |    28.22    |  42.74  |
-| **ProSparse-13B**$^*$ |        87.97        |       29.03        |          69.75           |          67.54           | 25.40 | 54.78 | 40.20 |    28.76    |  45.07  |
 |   **ProSparse-13B**   |        88.80        |       28.42        |          69.76           |          66.91           | 26.31 | 54.35 | 39.90 |    28.67    |  44.90  |
-**Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. "ProSparse-7B$^*$" and "ProSparse-13B$^*$" denote the ProSparse versions without activation threshold shifting.
 ### Inference Acceleration Effects
@@ -101,10 +101,10 @@ First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of
 Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
-- Step (2): a fused operator of ReLU and $\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)$;
-- Step (3): a sparse matrix-vector multiplication operator $\mathbf{x}_1 \mathbf{W}_2^T$.
-where $\mathbf{s}$, $\mathbf{x}$, $\mathbf{x}_1$, and $\odot$ denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. $\mathbf{W}_1$ and $\mathbf{W}_2$ are FFN weight matrices.
 The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
@@ -112,14 +112,14 @@ The acceleration effects of LLMs with different sparsity are displayed as follow
 | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
 |     ReluLLaMA-7B      |        66.98        |        90.89         |         58.95         |        11.37        |      67.12       |        1.35         |       63.00       |         1.32         |
 |    Vanilla ReLU-7B    |        66.04        |        87.72         |         72.57         |        12.04        |      67.85       |        1.33         |       63.28       |         1.31         |
-|    Fixed $L_1$-7B     |        91.46        |        94.51         |         82.85         |        19.62        |      40.99       |        2.21         |       54.19       |         1.53         |
-| **ProSparse-7B**$^*$  |        88.11        |        93.46         |         75.24         |        16.30        |      46.66       |        1.94         |       55.56       |         1.49         |
 |   **ProSparse-7B**    |        89.32        |        92.34         |         78.75         |          -          |      45.38       |        2.00         |       55.05       |         1.51         |
 |     ReluLLaMA-13B     |        71.56        |        86.41         |         71.93         |        6.59         |      69.92       |        1.88         |       75.47       |         1.51         |
-| **ProSparse-13B**$^*$ |        87.97        |        91.02         |         77.93         |        8.67         |      55.29       |        2.38         |       67.50       |         1.68         |
 |   **ProSparse-13B**   |        88.80        |        91.11         |         78.28         |          -          |      53.78       |        2.44         |       66.73       |         1.70         |
-**Notes**: Fixed $L_1$ suffers from severe performance degradation. ProSparse with Activation Threshold Shifting is not supported by PowerInfer. "Time" means the average wall-clock time (us) cost by each step with our sparse GPU operators, and "Speedup" is the speedup ratio to the setting without operators. The average time for step (2) and (3) without sparse GPU operators is about **90.55 and 82.92 (us) for 7B, 131.36 and 113.68 (us) for 13B** respectively under all sparsity.
 ### License Disclaimer

 |     ReluLLaMA-7B      |        66.98        |       15.85        |          69.64           |          70.54           | 5.84  | 38.64 | 35.07 |    27.73    |  37.62  |
 |    Vanilla ReLU-7B    |        66.04        |       21.31        |          70.73           |          73.22           | 11.22 | 49.22 | 36.11 |    28.01    |  41.40  |
 |    Shifted ReLU-7B    |        69.59        |       20.50        |          70.09           |          73.17           | 13.87 | 48.54 | 35.20 |    27.94    |  41.33  |
+|    Fixed \\(L_1\\)-7B     |        91.46        |       18.85        |          66.01           |          55.39           | 2.27  | 32.28 | 31.40 |    26.48    |  33.24  |
+| **ProSparse-7B**\\(^*\\)  |        88.11        |       19.47        |          66.29           |          63.33           | 12.74 | 45.21 | 33.59 |    27.55    |  38.31  |
 |   **ProSparse-7B**    |        89.32        |       19.42        |          66.27           |          63.50           | 12.13 | 45.48 | 34.99 |    27.46    |  38.46  |
 |     Original-13B      |          -          |       20.19        |          72.58           |          71.55           | 22.21 | 54.69 | 37.89 |    29.33    |  44.06  |
 |     ReluLLaMA-13B     |        71.56        |       20.19        |          70.44           |          73.29           | 18.50 | 50.58 | 37.97 |    28.22    |  42.74  |
+| **ProSparse-13B**\\(^*\\) |        87.97        |       29.03        |          69.75           |          67.54           | 25.40 | 54.78 | 40.20 |    28.76    |  45.07  |
 |   **ProSparse-13B**   |        88.80        |       28.42        |          69.76           |          66.91           | 26.31 | 54.35 | 39.90 |    28.67    |  44.90  |
+**Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. "ProSparse-7B\\(^*\\)" and "ProSparse-13B\\(^*\\)" denote the ProSparse versions without activation threshold shifting.
 ### Inference Acceleration Effects
 Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
+- Step (2): a fused operator of ReLU and \\(\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)\\);
+- Step (3): a sparse matrix-vector multiplication operator \\(\mathbf{x}_1 \mathbf{W}_2^T\\).
+where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
 The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
 | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
 |     ReluLLaMA-7B      |        66.98        |        90.89         |         58.95         |        11.37        |      67.12       |        1.35         |       63.00       |         1.32         |
 |    Vanilla ReLU-7B    |        66.04        |        87.72         |         72.57         |        12.04        |      67.85       |        1.33         |       63.28       |         1.31         |
+|    Fixed \\(L_1\\)-7B     |        91.46        |        94.51         |         82.85         |        19.62        |      40.99       |        2.21         |       54.19       |         1.53         |
+| **ProSparse-7B**\\(^*\\)  |        88.11        |        93.46         |         75.24         |        16.30        |      46.66       |        1.94         |       55.56       |         1.49         |
 |   **ProSparse-7B**    |        89.32        |        92.34         |         78.75         |          -          |      45.38       |        2.00         |       55.05       |         1.51         |
 |     ReluLLaMA-13B     |        71.56        |        86.41         |         71.93         |        6.59         |      69.92       |        1.88         |       75.47       |         1.51         |
+| **ProSparse-13B**\\(^*\\) |        87.97        |        91.02         |         77.93         |        8.67         |      55.29       |        2.38         |       67.50       |         1.68         |
 |   **ProSparse-13B**   |        88.80        |        91.11         |         78.28         |          -          |      53.78       |        2.44         |       66.73       |         1.70         |
+**Notes**: Fixed \\(L_1\\) suffers from severe performance degradation. ProSparse with Activation Threshold Shifting is not supported by PowerInfer. "Time" means the average wall-clock time (us) cost by each step with our sparse GPU operators, and "Speedup" is the speedup ratio to the setting without operators. The average time for step (2) and (3) without sparse GPU operators is about **90.55 and 82.92 (us) for 7B, 131.36 and 113.68 (us) for 13B** respectively under all sparsity.
 ### License Disclaimer