SparseLLM
/

prosparse-llama-2-7b

@@ -51,19 +51,19 @@ Intuitively, training the model with even more tokens or with data of a wider co
 The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](TODO) for more details):
 1. **Activation Function Substitution**: We substituting the activation function of FFNs with ReLU and applying continual training;
-2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(`L_1`\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor $`\lambda`$ is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
 3. **Activation Threshold Shifting**: We finally replace ReLU with FATReLU ([Kurtz et al., 2020](https://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf)), a ReLU variant with a positive threshold. This can prune those non-zero weakly-contributed elements in activation outputs and further boost sparsity.
-The 7B model is trained on 8 A100 GPUs. The learning rate (LR) is controlled by a cosine scheduler with a peak LR of $`3e-5`$. The hyper-parameters for each stage (including the regularization factor $`\lambda_i`$, the accumulated training steps $`T_i`$, and the accumulated training tokens) are shown as follows:
-| Step Number $`i`$ | $`\lambda_i`$ | $`T_i`$  | Accumulated Tokens (B) |
 | :-------------: | :---------: | :----: | :--------------------: |
 |        0        |      0      | 5,000  |         10.49          |
-|        1        |   $`5e-3`$    | 6,000  |         12.58          |
-|        2        |   $`5e-2`$    | 10,000 |         20.97          |
-|        3        |   $`5e-2`$    | 12,000 |         25.17          |
-|        4        |   $`5e-1`$    | 16,000 |         33.55          |
-|        5        |   $`5e-1`$    | 16,500 |         34.60          |
 ### Evaluation Benckmarks

 The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](TODO) for more details):
 1. **Activation Function Substitution**: We substituting the activation function of FFNs with ReLU and applying continual training;
+2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
 3. **Activation Threshold Shifting**: We finally replace ReLU with FATReLU ([Kurtz et al., 2020](https://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf)), a ReLU variant with a positive threshold. This can prune those non-zero weakly-contributed elements in activation outputs and further boost sparsity.
+The 7B model is trained on 8 A100 GPUs. The learning rate (LR) is controlled by a cosine scheduler with a peak LR of \\(3e-5\\). The hyper-parameters for each stage (including the regularization factor \\(\lambda_i\\), the accumulated training steps \\(T_i\\), and the accumulated training tokens) are shown as follows:
+| Step Number \\(i\\) | \\(\lambda_i\\) | \\(T_i\\)  | Accumulated Tokens (B) |
 | :-------------: | :---------: | :----: | :--------------------: |
 |        0        |      0      | 5,000  |         10.49          |
+|        1        |   \\(5e-3\\)    | 6,000  |         12.58          |
+|        2        |   \\(5e-2\\)    | 10,000 |         20.97          |
+|        3        |   \\(5e-2\\)    | 12,000 |         25.17          |
+|        4        |   \\(5e-1\\)    | 16,000 |         33.55          |
+|        5        |   \\(5e-1\\)    | 16,500 |         34.60          |
 ### Evaluation Benckmarks