Raincleared commited on
Commit
80c4c87
1 Parent(s): a14d09d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -85,15 +85,15 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
85
  | ReluLLaMA-7B | 66.98 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | 37.62 |
86
  | Vanilla ReLU-7B | 66.04 | 21.31 | 70.73 | 73.22 | 11.22 | 49.22 | 36.11 | 28.01 | 41.40 |
87
  | Shifted ReLU-7B | 69.59 | 20.50 | 70.09 | 73.17 | 13.87 | 48.54 | 35.20 | 27.94 | 41.33 |
88
- | Fixed $L_1$-7B | 91.46 | 18.85 | 66.01 | 55.39 | 2.27 | 32.28 | 31.40 | 26.48 | 33.24 |
89
- | **ProSparse-7B**$^*$ | 88.11 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | 38.31 |
90
  | **ProSparse-7B** | 89.32 | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | 38.46 |
91
  | Original-13B | - | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | 44.06 |
92
  | ReluLLaMA-13B | 71.56 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | 42.74 |
93
- | **ProSparse-13B**$^*$ | 87.97 | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | 45.07 |
94
  | **ProSparse-13B** | 88.80 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | 44.90 |
95
 
96
- **Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. "ProSparse-7B$^*$" and "ProSparse-13B$^*$" denote the ProSparse versions without activation threshold shifting.
97
 
98
  ### Inference Acceleration Effects
99
 
@@ -101,10 +101,10 @@ First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of
101
 
102
  Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
103
 
104
- - Step (2): a fused operator of ReLU and $\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)$;
105
- - Step (3): a sparse matrix-vector multiplication operator $\mathbf{x}_1 \mathbf{W}_2^T$.
106
 
107
- where $\mathbf{s}$, $\mathbf{x}$, $\mathbf{x}_1$, and $\odot$ denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. $\mathbf{W}_1$ and $\mathbf{W}_2$ are FFN weight matrices.
108
 
109
  The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
110
 
@@ -112,14 +112,14 @@ The acceleration effects of LLMs with different sparsity are displayed as follow
112
  | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
113
  | ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 67.12 | 1.35 | 63.00 | 1.32 |
114
  | Vanilla ReLU-7B | 66.04 | 87.72 | 72.57 | 12.04 | 67.85 | 1.33 | 63.28 | 1.31 |
115
- | Fixed $L_1$-7B | 91.46 | 94.51 | 82.85 | 19.62 | 40.99 | 2.21 | 54.19 | 1.53 |
116
- | **ProSparse-7B**$^*$ | 88.11 | 93.46 | 75.24 | 16.30 | 46.66 | 1.94 | 55.56 | 1.49 |
117
  | **ProSparse-7B** | 89.32 | 92.34 | 78.75 | - | 45.38 | 2.00 | 55.05 | 1.51 |
118
  | ReluLLaMA-13B | 71.56 | 86.41 | 71.93 | 6.59 | 69.92 | 1.88 | 75.47 | 1.51 |
119
- | **ProSparse-13B**$^*$ | 87.97 | 91.02 | 77.93 | 8.67 | 55.29 | 2.38 | 67.50 | 1.68 |
120
  | **ProSparse-13B** | 88.80 | 91.11 | 78.28 | - | 53.78 | 2.44 | 66.73 | 1.70 |
121
 
122
- **Notes**: Fixed $L_1$ suffers from severe performance degradation. ProSparse with Activation Threshold Shifting is not supported by PowerInfer. "Time" means the average wall-clock time (us) cost by each step with our sparse GPU operators, and "Speedup" is the speedup ratio to the setting without operators. The average time for step (2) and (3) without sparse GPU operators is about **90.55 and 82.92 (us) for 7B, 131.36 and 113.68 (us) for 13B** respectively under all sparsity.
123
 
124
  ### License Disclaimer
125
 
 
85
  | ReluLLaMA-7B | 66.98 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | 37.62 |
86
  | Vanilla ReLU-7B | 66.04 | 21.31 | 70.73 | 73.22 | 11.22 | 49.22 | 36.11 | 28.01 | 41.40 |
87
  | Shifted ReLU-7B | 69.59 | 20.50 | 70.09 | 73.17 | 13.87 | 48.54 | 35.20 | 27.94 | 41.33 |
88
+ | Fixed \\(L_1\\)-7B | 91.46 | 18.85 | 66.01 | 55.39 | 2.27 | 32.28 | 31.40 | 26.48 | 33.24 |
89
+ | **ProSparse-7B**\\(^*\\) | 88.11 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | 38.31 |
90
  | **ProSparse-7B** | 89.32 | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | 38.46 |
91
  | Original-13B | - | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | 44.06 |
92
  | ReluLLaMA-13B | 71.56 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | 42.74 |
93
+ | **ProSparse-13B**\\(^*\\) | 87.97 | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | 45.07 |
94
  | **ProSparse-13B** | 88.80 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | 44.90 |
95
 
96
+ **Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. "ProSparse-7B\\(^*\\)" and "ProSparse-13B\\(^*\\)" denote the ProSparse versions without activation threshold shifting.
97
 
98
  ### Inference Acceleration Effects
99
 
 
101
 
102
  Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
103
 
104
+ - Step (2): a fused operator of ReLU and \\(\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)\\);
105
+ - Step (3): a sparse matrix-vector multiplication operator \\(\mathbf{x}_1 \mathbf{W}_2^T\\).
106
 
107
+ where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
108
 
109
  The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
110
 
 
112
  | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
113
  | ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 67.12 | 1.35 | 63.00 | 1.32 |
114
  | Vanilla ReLU-7B | 66.04 | 87.72 | 72.57 | 12.04 | 67.85 | 1.33 | 63.28 | 1.31 |
115
+ | Fixed \\(L_1\\)-7B | 91.46 | 94.51 | 82.85 | 19.62 | 40.99 | 2.21 | 54.19 | 1.53 |
116
+ | **ProSparse-7B**\\(^*\\) | 88.11 | 93.46 | 75.24 | 16.30 | 46.66 | 1.94 | 55.56 | 1.49 |
117
  | **ProSparse-7B** | 89.32 | 92.34 | 78.75 | - | 45.38 | 2.00 | 55.05 | 1.51 |
118
  | ReluLLaMA-13B | 71.56 | 86.41 | 71.93 | 6.59 | 69.92 | 1.88 | 75.47 | 1.51 |
119
+ | **ProSparse-13B**\\(^*\\) | 87.97 | 91.02 | 77.93 | 8.67 | 55.29 | 2.38 | 67.50 | 1.68 |
120
  | **ProSparse-13B** | 88.80 | 91.11 | 78.28 | - | 53.78 | 2.44 | 66.73 | 1.70 |
121
 
122
+ **Notes**: Fixed \\(L_1\\) suffers from severe performance degradation. ProSparse with Activation Threshold Shifting is not supported by PowerInfer. "Time" means the average wall-clock time (us) cost by each step with our sparse GPU operators, and "Speedup" is the speedup ratio to the setting without operators. The average time for step (2) and (3) without sparse GPU operators is about **90.55 and 82.92 (us) for 7B, 131.36 and 113.68 (us) for 13B** respectively under all sparsity.
123
 
124
  ### License Disclaimer
125