Feature Extraction
Transformers
Safetensors
English
bamboo
custom_code
yixinsong commited on
Commit
e809a07
1 Parent(s): 439b36d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -8,7 +8,7 @@ However, the widespread adoption of ReLU-based models in the LLM field remains l
8
 
9
  ## Model Architecture
10
 
11
- As the ReGLU-based LLM has limited sparsity, for example, [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) has just nearly 67% sparsity. To further push the model's sparsity, we add a relu component after GLU. So our FFN network works as follows:
12
 
13
  ```Python
14
  class BambooMLP(nn.Module):
@@ -30,7 +30,7 @@ class BambooMLP(nn.Module):
30
 
31
  In this section, we introduce the details of training our model, including types of data used, and hyperparameters.
32
 
33
- We initialized the model weights to Mistral's model weights and modified the FFN structure to the ReGLU+ReLU structure, then continued pre-training for 200B tokens, divided into two phases:
34
 
35
  **First phase**: For the proportion of training corpus, we followed the data mix ratio and sources of the StableLM-3B model ([link](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo)), conducting a further pre-training with 150B tokens.
36
 
 
8
 
9
  ## Model Architecture
10
 
11
+ To push the model's sparsity, we add a ReLU component after GLU component, called dReLU(double ReLU) So our FFN network works as follows:
12
 
13
  ```Python
14
  class BambooMLP(nn.Module):
 
30
 
31
  In this section, we introduce the details of training our model, including types of data used, and hyperparameters.
32
 
33
+ We initialized the model weights to Mistral's model weights and modified the FFN structure to the dReLU structure, then continued pre-training for 200B tokens, divided into two phases:
34
 
35
  **First phase**: For the proportion of training corpus, we followed the data mix ratio and sources of the StableLM-3B model ([link](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo)), conducting a further pre-training with 150B tokens.
36