Correct blogpost link
Browse files
README.md
CHANGED
@@ -54,7 +54,7 @@ for seq in sequences:
|
|
54 |
|
55 |
```
|
56 |
|
57 |
-
For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [
|
58 |
|
59 |
You will need **at least 85-100GB of memory** to swiftly run inference with Falcon-40B.
|
60 |
|
@@ -153,11 +153,11 @@ Falcon-40B is a causal decoder-only model trained on a causal language modeling
|
|
153 |
|
154 |
The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:
|
155 |
|
156 |
-
* **
|
157 |
* **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
|
158 |
* **Decoder-block:** parallel attention/MLP with a single layer norm.
|
159 |
|
160 |
-
For multiquery, we are using an internal variant
|
161 |
|
162 |
| **Hyperparameter** | **Value** | **Comment** |
|
163 |
|--------------------|-----------|----------------------------------------|
|
@@ -175,7 +175,7 @@ Falcon-40B-Instruct was trained on AWS SageMaker, on 64 A100 40GB GPUs in P4d in
|
|
175 |
|
176 |
#### Software
|
177 |
|
178 |
-
Falcon-40B-Instruct was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)
|
179 |
|
180 |
|
181 |
## Citation
|
@@ -184,7 +184,7 @@ Falcon-40B-Instruct was trained a custom distributed training codebase, Gigatron
|
|
184 |
```
|
185 |
@article{falcon40b,
|
186 |
title={{Falcon-40B}: an open large language model with state-of-the-art performance},
|
187 |
-
author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
|
188 |
year={2023}
|
189 |
}
|
190 |
```
|
|
|
54 |
|
55 |
```
|
56 |
|
57 |
+
For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blog post](https://huggingface.co/blog/falcon).
|
58 |
|
59 |
You will need **at least 85-100GB of memory** to swiftly run inference with Falcon-40B.
|
60 |
|
|
|
153 |
|
154 |
The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:
|
155 |
|
156 |
+
* **Positional embeddings:** rotary ([Su et al., 2021](https://arxiv.org/abs/2104.09864));
|
157 |
* **Attention:** multiquery ([Shazeer et al., 2019](https://arxiv.org/abs/1911.02150)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135));
|
158 |
* **Decoder-block:** parallel attention/MLP with a single layer norm.
|
159 |
|
160 |
+
For multiquery, we are using an internal variant that uses independent keys and values per tensor parallel degree.
|
161 |
|
162 |
| **Hyperparameter** | **Value** | **Comment** |
|
163 |
|--------------------|-----------|----------------------------------------|
|
|
|
175 |
|
176 |
#### Software
|
177 |
|
178 |
+
Falcon-40B-Instruct was trained in a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)
|
179 |
|
180 |
|
181 |
## Citation
|
|
|
184 |
```
|
185 |
@article{falcon40b,
|
186 |
title={{Falcon-40B}: an open large language model with state-of-the-art performance},
|
187 |
+
author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz, and Cappelli, Alessandro and Cojocaru, Ruxandra, and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
|
188 |
year={2023}
|
189 |
}
|
190 |
```
|