GGUF
English
Inference Endpoints
ybelkada commited on
Commit
d76935b
1 Parent(s): fa25c72

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -0
README.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: falcon-mamba-license
4
+ license_link: https://falconllm.tii.ae/falcon-mamba-7b-terms-and-conditions.html
5
+ base_model: tiiuae/falcon-mamba-7b
6
+ language:
7
+ - en
8
+ datasets:
9
+ - tiiuae/falcon-refinedweb
10
+ ---
11
+
12
+ <img src="https://huggingface.co/datasets/tiiuae/documentation-images/resolve/main/falcon_mamba/thumbnail.png" alt="drawing" width="800"/>
13
+
14
+ **GGUF quantization of [`falcon-mamba-7b`](https://huggingface.co/tiiuae/falcon-mamba-7b) in the format `BF16`**
15
+
16
+ # Table of Contents
17
+
18
+ 0. [TL;DR](#TL;DR)
19
+ 1. [Model Details](#model-details)
20
+ 2. [Usage](#usage)
21
+ 3. [Training Details](#training-details)
22
+ 4. [Evaluation](#evaluation)
23
+
24
+
25
+ # TL;DR
26
+
27
+ # Model Details
28
+
29
+ ## Model Description
30
+
31
+ - **Developed by:** [https://www.tii.ae](https://www.tii.ae)
32
+ - **Model type:** Causal decoder-only
33
+ - **Architecture:** Mamba
34
+ - **Language(s) (NLP):** Mainly English
35
+ - **License:** TII Falcon-Mamba License 2.0
36
+
37
+ <br>
38
+
39
+ # Usage
40
+
41
+ Refer to the documentation of [`llama.cpp`](https://github.com/ggerganov/llama.cpp) to understand how to run this model locally on your machine.
42
+
43
+ Download the GGUF weights with the command below:
44
+
45
+ ```bash
46
+ huggingface-cli download tiiuae/falcon-mamba-7b-BF16-GGUF --include falcon-mamba-BF16.gguf --local-dir ./
47
+ ```
48
+
49
+ Once downloaded, you can quickly chat with it:
50
+
51
+ ```bash
52
+ ./llama-cli -m falcon-mamba-BF16.gguf -p "Hello how are you?"
53
+ ```
54
+
55
+ # Training Details
56
+
57
+ ## Training Data
58
+
59
+ Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
60
+ Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length from 2,048 to 8,192.
61
+ Moreover, inspired by the concept of Curriculum Learning, we carefully selected data mixtures throughout the training stages, considering both data diversity and complexity.
62
+ Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
63
+ At the last training stage, small portion of high-quality curated data was used to further enhance performance.
64
+
65
+ Overall, the data sources included RefinedWeb-English, high quality technical data, code data and math data extracted from public sources.
66
+ In particular, we used samples coming from [Fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) during our last training stage.
67
+
68
+ The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
69
+
70
+ ## Training Procedure
71
+ Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
72
+
73
+ ### Training Hyperparameters
74
+
75
+ | **Hyperparameter** | **Value** | **Comment** |
76
+ |--------------------|------------|-------------------------------------------|
77
+ | Precision | `bfloat16` | |
78
+ | Optimizer | AdamW | |
79
+ | Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
80
+ | Weight decay | 1e-1 | |
81
+ | Batch size | 2048 | |
82
+
83
+
84
+ The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training.
85
+ In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT.
86
+ Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
87
+
88
+ ### Speeds, Sizes, Times
89
+
90
+ The model training took roughly two months.
91
+
92
+ <br>
93
+
94
+ # Evaluation
95
+
96
+ ## Benchmarks
97
+
98
+ We evaluate our model on all benchmarks of the new leaderboard's version using the `lm-evaluation-harness` package, and then normalize the evaluation results with HuggingFace score normalization.
99
+
100
+
101
+ | `model name` |`IFEval`| `BBH` |`MATH LvL5`| `GPQA`| `MUSR`|`MMLU-PRO`|`Average`|
102
+ |:--------------------------|:------:|:-----:|:---------:|:-----:|:-----:|:--------:|:-------:|
103
+ | ***Pure SSM models*** | | | | | | | |
104
+ | `FalconMamba-7B` | 33.36 | 19.88 | 3.63 |8.05 |10.86 | 14.47 |**15.04**|
105
+ | `TRI-ML/mamba-7b-rw`<sup>*</sup>| 22.46 | 6.71 | 0.45 | 1.12 | 5.51 | 1.69 | 6.25 |
106
+ |***Hybrid SSM-attention models*** | | | | | | |
107
+ |`recurrentgemma-9b` | 30.76 | 14.80 | 4.83 | 4.70 | 6.60 | 17.88 | 13.20 |
108
+ | `Zyphra/Zamba-7B-v1`<sup>*</sup> | 24.06 | 21.12 | 3.32 | 3.03 | 7.74 | 16.02 | 12.55 |
109
+ |***Transformer models*** | | | | | | | |
110
+ | `Falcon2-11B` | 32.61 | 21.94 | 2.34 | 2.80 | 7.53 | 15.44 | 13.78 |
111
+ | `Meta-Llama-3-8B` | 14.55 | 24.50 | 3.25 | 7.38 | 6.24 | 24.55 | 13.41 |
112
+ | `Meta-Llama-3.1-8B` | 12.70 | 25.29 | 4.61 | 6.15 | 8.98 | 24.95 | 13.78 |
113
+ | `Mistral-7B-v0.1` | 23.86 | 22.02 | 2.49 | 5.59 | 10.68 | 22.36 | 14.50 |
114
+ | `Mistral-Nemo-Base-2407 (12B)` | 16.83 | 29.37 | 4.98 | 5.82 | 6.52 | 27.46 | 15.08 |
115
+ | `gemma-7B` | 26.59 | 21.12 | 6.42 | 4.92 | 10.98 | 21.64 |**15.28**|
116
+
117
+
118
+ Also, we evaluate our model on the benchmarks of the first leaderboard using `lighteval`.
119
+
120
+
121
+ | `model name` |`ARC`|`HellaSwag` |`MMLU` |`Winogrande`|`TruthfulQA`|`GSM8K`|`Average` |
122
+ |:-----------------------------|:------:|:---------:|:-----:|:----------:|:----------:|:-----:|:----------------:|
123
+ | ***Pure SSM models*** | | | | | | | |
124
+ | `FalconMamba-7B`<sup>*</sup> | 62.03 | 80.82 | 62.11 | 73.64 | 53.42 | 52.54 | **64.09** |
125
+ | `TRI-ML/mamba-7b-rw`<sup>*</sup> | 51.25 | 80.85 | 33.41 | 71.11 | 32.08 | 4.70 | 45.52 |
126
+ |***Hybrid SSM-attention models***| | | | | | | |
127
+ | `recurrentgemma-9b`<sup>**</sup> |52.00 | 80.40 | 60.50 | 73.60 | 38.60 | 42.60 | 57.95 |
128
+ | `Zyphra/Zamba-7B-v1`<sup>*</sup> | 56.14 | 82.23 | 58.11 | 79.87 | 52.88 | 30.78 | 60.00 |
129
+ |***Transformer models*** | | | | | | | |
130
+ | `Falcon2-11B` | 59.73 | 82.91 | 58.37 | 78.30 | 52.56 | 53.83 | **64.28** |
131
+ | `Meta-Llama-3-8B` | 60.24 | 82.23 | 66.70 | 78.45 | 42.93 | 45.19 | 62.62 |
132
+ | `Meta-Llama-3.1-8B` | 58.53 | 82.13 | 66.43 | 74.35 | 44.29 | 47.92 | 62.28 |
133
+ | `Mistral-7B-v0.1` | 59.98 | 83.31 | 64.16 | 78.37 | 42.15 | 37.83 | 60.97 |
134
+ | `gemma-7B` | 61.09 | 82.20 | 64.56 | 79.01 | 44.79 | 50.87 | 63.75 |
135
+
136
+ Mostly, we took evaluation results from both leaderboards. For the models marked by *star* we evaluated the tasks internally, while for the models marked by two *stars* the results were taken from paper or model card.
137
+
138
+ <br>
139
+
140
+ # Technical Specifications
141
+
142
+ ## Model Architecture and Objective
143
+
144
+ Falcon-Mamba-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
145
+
146
+ The model is based on the Mamba architecture ([Gu et al., 2023](https://arxiv.org/abs/2312.00752)).
147
+
148
+ | **Hyperparameter** | **Value** | **Comment** |
149
+ |--------------------|-----------|----------------------------------------|
150
+ | Layers | 64 | Number of layers |
151
+ | `d_model` | 4096 | Hidden dimension |
152
+ | `d_state` | 16 | The SSM state dimension |
153
+ | Vocabulary | 65024 | Vocabulary Size |
154
+ | Sequence length | 8192 | During the last training stages |
155
+
156
+ ## Compute Infrastructure
157
+
158
+ ### Hardware
159
+
160
+ Falcon-Mamba-7B was trained on AWS SageMaker, using on average 256 H100 80GB GPUs in 32 p5 instances.
161
+
162
+ ### Software
163
+
164
+ Falcon-Mamba-7B was trained on an internal distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO, high-performance Triton kernels.
165
+
166
+ <br>
167
+
168
+ # Citation
169
+
170
+ *Paper coming soon* 😊.
171
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
172
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/tiiuae__falcon-mamba-7b-details)
173
+
174
+ | Metric |Value|
175
+ |-------------------|----:|
176
+ |Avg. |15.04|
177
+ |IFEval (0-Shot) |33.36|
178
+ |BBH (3-Shot) |19.88|
179
+ |MATH Lvl 5 (4-Shot)| 3.63|
180
+ |GPQA (0-shot) | 8.05|
181
+ |MuSR (0-shot) |10.86|
182
+ |MMLU-PRO (5-shot) |14.47|
183
+