Feature Extraction
Transformers
Safetensors
English
bamboo
custom_code
yixinsong commited on
Commit
bda84dd
1 Parent(s): c4f70ae

update README

Browse files
Files changed (1) hide show
  1. README.md +60 -30
README.md CHANGED
@@ -1,10 +1,17 @@
1
- Intro
2
- Sparse computing is increasingly recognized as an important direction to improve the computational efficiency of large language models (LLM). Among various approaches, a mixture of experts (MoE) methods (exemplified by models such as Mixtral) show particular promise. MoE works by selectively activating different model components (experts), thereby optimizing resource usage.
3
- Recent studies (Zhang et al., 2021; Liu et al., 2023; Mirzadeh et al., 2023) have shown that LLM inherently exhibits properties that favor sparse computation when using the ReLU activation function. This insight opens new avenues for model efficiency, similar to the selective activation of MoE. By dynamically selecting model parameters for calculations, we can significantly improve efficiency.
4
- However, the widespread adoption of ReLU-based models in the LLM field is still limited. To inspire more research for inference efficiency, we introduce our Mistral-level ReLU-based LLM model, XXX.
5
- Model Architecture
6
- As the ReGLU-based LLM has limited sparsity, for example, ReLULLaMA has just nearly 70% sparsity. To further push the sparsity, we add a relu component after GLU. So our FFN network works as follows:
7
- class XXXMLP(nn.Module):
 
 
 
 
 
 
 
8
  def __init__(self, config):
9
  super().__init__()
10
  self.config = config
@@ -17,37 +24,60 @@ class XXXMLP(nn.Module):
17
 
18
  def forward(self, x):
19
  return self.down_proj(self.act_fn(self.gate_proj(x)) * self.act_fn(self.up_proj(x)))
20
- Training Details
21
- In this subsection, we will introduce the details of training our model, including types of data used, and hyperparameters.
 
 
 
 
22
  We initialized the model weights to Mistral's model weights and modified the FFN structure to the ReGLU+ReLU structure, then continued pre-training for 200B tokens, divided into two phases:
23
- First phase: For the proportion of training corpus, we followed the data mix ratio and sources of the StableLM-3B model, conducting a further pre-training with 150B tokens.(link)
 
 
24
  The following table shows the hyper-paramters we used in our training process.
25
- |-------------------------|----------------------|
26
- | GPUs | 64 80G-A100 |
27
- | Learning Rate Control | Cosine |
28
- | Peak Learning Rate | 5e-5 |
29
- | Batch Size | 4M |
30
- | Weight Decay | 0.1 |
31
- Second phase: We further adjusted the training corpus ratio, incorporating more domain-specific datasets(Math、Coding), and continued training for 50B tokens.
32
- |-------------------------|----------------------|
33
- | GPUs | 64 80G-A100 |
34
- | Learning Rate Control | Cosine |
35
- | Peak Learning Rate | 5e-6 |
36
- | Batch Size | 4M |
37
- | Weight Decay | 0.01 |
38
- Performance Evaluation Results
 
 
 
 
 
 
 
39
  Our evaluation is based on the framework lm-evaluation-harness and opencompass. The evaluation details are listed as follows:
 
40
  - Huggingface LLM Leaderboard tasks.
41
  - Commonsense: We report the average of PIQA, SIQA, ARC easy and challenge and CommonsenseQA.
42
  - Other Popular Benchmarks: We report the average accuracies on Big Bench Hard (BBH) (3-shot), HumanEval, MBPP, MATH.
43
 
44
- | MMLU | Winogrande | TruthfulQA | Hellaswag | GSM8K | Arc-C | | | | | | |
45
- |---------|------------|------------|-----------|--------|--------|--------|---|---|---|---|---|
46
- | Ours | 0.6389 | 0.7593 | 0.4406 | 0.8217 | 0.5315 | 0.6195 | | | | | |
47
- | Mistral | 0.6265 | 0.7924 | 0.4262 | 0.8332 | 0.4018 | 0.6143 | | | | | |
 
 
 
 
 
 
48
 
 
 
49
 
 
50
 
 
51
 
52
- Speed Evaluation Results
53
- We utilize PowerInfer, a state-of-the-art acceleration framework leveraging activation sparsity. Here we show the inference speed compared with llama.cpp/transformers.
 
1
+ ## Introducation
2
+
3
+ Sparse computing is increasingly recognized as an important direction to improve the computational efficiency of large language models (LLM). Among various approaches, a mixture of experts (MoE) methods (exemplified by models such as [Mixtral]([mistralai/Mixtral-8x7B-v0.1 · Hugging Face](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1))) show particular promise. MoE works by selectively activating different model components (experts), thereby optimizing resource usage.
4
+
5
+ Recent studies ([Zhang el al., 2021](https://arxiv.org/abs/2110.01786); [Liu et al., 2023](https://openreview.net/pdf?id=wIPIhHd00i); [Mirzadeh et al., 2023](https://arxiv.org/abs/2310.04564)) reveal that LLMs inherently exhibit properties conducive to sparse computation when employing the ReLU activation function. This insight opens up new avenues for model efficiency, akin to MoE's selective activation. By dynamically choosing model parameters for computation, we can substantially boost efficiency.
6
+
7
+ However, the widespread adoption of ReLU-based models in the LLM field remains limited. Here we introduce a new 7B ReLU-based LLM, Bamboo, which boasts nearly 85% sparsity and performance levels on par with [Mistral]([mistralai/Mistral-7B-v0.1 · Hugging Face](https://huggingface.co/mistralai/Mistral-7B-v0.1)).
8
+
9
+ ## Model Architecture
10
+
11
+ As the ReGLU-based LLM has limited sparsity, for example, [ReLULLaMA]([SparseLLM/ReluLLaMA-7B · Hugging Face](https://huggingface.co/SparseLLM/ReluLLaMA-7B)) has just nearly 67% sparsity. To further push the model's sparsity, we add a relu component after GLU. So our FFN network works as follows:
12
+
13
+ ```Python
14
+ class BambooMLP(nn.Module):
15
  def __init__(self, config):
16
  super().__init__()
17
  self.config = config
 
24
 
25
  def forward(self, x):
26
  return self.down_proj(self.act_fn(self.gate_proj(x)) * self.act_fn(self.up_proj(x)))
27
+ ```
28
+
29
+ ## Training Details
30
+
31
+ In this section, we introduce the details of training our model, including types of data used, and hyperparameters.
32
+
33
  We initialized the model weights to Mistral's model weights and modified the FFN structure to the ReGLU+ReLU structure, then continued pre-training for 200B tokens, divided into two phases:
34
+
35
+ **First phase**: For the proportion of training corpus, we followed the data mix ratio and sources of the StableLM-3B model, conducting a further pre-training with 150B tokens.(link)
36
+
37
  The following table shows the hyper-paramters we used in our training process.
38
+
39
+ | Hyper-parameters | |
40
+ | --------------------- | ----------- |
41
+ | GPUs | 64 80G-A100 |
42
+ | Learning Rate Control | Cosine |
43
+ | Peak Learning Rate | 5e-5 |
44
+ | Batch Size | 4M |
45
+ | Weight Decay | 0.1 |
46
+
47
+ **Second phase**: We further adjusted the training corpus ratio, incorporating more domain-specific datasets(Math、Coding), and continued training for 50B tokens.
48
+
49
+ | Hyper-parameters | |
50
+ | --------------------- | ----------- |
51
+ | GPUs | 64 80G-A100 |
52
+ | Learning Rate Control | Cosine |
53
+ | Peak Learning Rate | 5e-6 |
54
+ | Batch Size | 4M |
55
+ | Weight Decay | 0.01 |
56
+
57
+ ## Performance Evaluation Results
58
+
59
  Our evaluation is based on the framework lm-evaluation-harness and opencompass. The evaluation details are listed as follows:
60
+
61
  - Huggingface LLM Leaderboard tasks.
62
  - Commonsense: We report the average of PIQA, SIQA, ARC easy and challenge and CommonsenseQA.
63
  - Other Popular Benchmarks: We report the average accuracies on Big Bench Hard (BBH) (3-shot), HumanEval, MBPP, MATH.
64
 
65
+ | | MMLU | Winogrande | TruthfulQA | Hellaswag | GSM8K | Arc-C | HumanEval | BBH | Average |
66
+ | ------- | ------ | ---------- | ---------- | --------- | ------ | ------ | --------- | ---- | ------- |
67
+ | Ours | 0.6389 | 0.7593 | 0.4406 | 0.8217 | 0.5315 | 0.6195 | 0.256 | | |
68
+ | Mistral | 0.6265 | 0.7924 | 0.4262 | 0.8332 | 0.4018 | 0.6143 | 0.2621 | | |
69
+
70
+ ## Speed Evaluation Results
71
+
72
+ We utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. Here we show the inference speed compared with llama.cpp/transformers.
73
+
74
+ ## Limitation & Disclaimer
75
 
76
+ - Bamboo, having undergone training with only 200B tokens, may still exhibit performance gaps in certain tasks.
77
+ - The Bamboo model has only been trained on English-language datasets, hence its capabilities in other languages are still lacking.
78
 
79
+ - The model may produce unexpected outputs due to its size and probabilistic generation paradigm.
80
 
81
+ ## License
82
 
83
+ The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage.