Text Generation
Transformers
Safetensors
English
llama
Llama-3-6B
6B
text-generation-inference
Inference Endpoints
prince-canuma commited on
Commit
fc7485b
1 Parent(s): 0776d55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -92
README.md CHANGED
@@ -5,10 +5,14 @@ license: llama3
5
  library_name: transformers
6
  datasets:
7
  - prince-canuma/fineweb-CC-MAIN-2024-10-1B-en
 
 
 
 
8
  ---
9
 
10
  # Model Summary
11
- <img src="llama-3-6B icon.jpeg" width="500" alt="Llama-3-6B"/>
12
 
13
  Introducing the world's first Llama-3 base model with 6B parameters. This model is a pretrained version of [prince-canuma/Llama-3-6B-v0](https://huggingface.co/prince-canuma/Llama-3-6B-v0), which was created from Meta-Llama-3-8B using a technique called [downcycling](https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=9hcOol4KHIgWThgt) .
14
  The model was continually pretrained on 1 billion tokens of English-only text from fineweb, achieving impressive results on the evaluation set:
@@ -24,16 +28,15 @@ This is the model card of a 🤗 transformers model that has been pushed on the
24
  - **Developed by:** [Prince Canuma](https://huggingface.co/prince-canuma)
25
  - **Sponsored by:** General
26
  - **Model type:** Llama
27
- - **Language(s) (NLP):** [More Information Needed]
28
- - **License:** MIT
29
  - **Pretrained from model:** prince-canuma/Llama-3-6B-v0
30
 
31
- ### Model Sources [optional]
32
 
33
  <!-- Provide the basic links for the model. -->
34
 
35
  - **Repository:** https://github.com/Blaizzy/Coding-LLMs-from-scratch/tree/main/Llama-3
36
- - **Video [optional]:** https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr
37
 
38
  ## Uses
39
 
@@ -83,25 +86,22 @@ Python 2 and Python 3 are two different versions of the Python language. Python
83
 
84
  ### Downcycling
85
 
86
- A technique that allows you to create new LLMs of diversa sizes from checkpoints of large pretrained models.
87
- You take a reference model (i.e., Llama-3-8B) and copy the weights of 24 layers out of 32 layers alongside embedding and prediction heads. Then you initialize a smaller target model with 24 layers and load those pretrained weights.
88
- This new model will most likely still output legible outputs, but for it to perform well you need continue the pretraining.
89
 
 
 
 
90
 
 
91
 
 
 
92
 
93
  ### Training Data
94
 
95
  For continued pretrained, I extracted 1B tokens from [Huggingface's FineWeb CC-Main-2024-10](https://huggingface.co/datasets/HuggingFaceFW/fineweb#breakdown-by-dumpcrawl) slice.
96
 
97
- ### Training Procedure
98
-
99
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
100
-
101
- #### Preprocessing [optional]
102
-
103
- [More Information Needed]
104
-
105
 
106
  #### Training hyperparameters
107
 
@@ -120,81 +120,6 @@ The following hyperparameters were used during training:
120
  - lr_scheduler_warmup_steps: 100
121
  - num_epochs: 2
122
 
123
- ### Training results
124
-
125
- | Training Loss | Epoch | Step | Validation Loss |
126
- |:-------------:|:-----:|:-----:|:---------------:|
127
- | 7.1562 | 0.0 | 1 | 7.1806 |
128
- | 2.7339 | 0.25 | 5867 | 2.6266 |
129
- | 2.6905 | 0.5 | 11734 | 2.5872 |
130
- | 2.6134 | 0.75 | 17601 | 2.5549 |
131
- | 2.532 | 1.0 | 23468 | 2.5235 |
132
- | 2.5319 | 1.25 | 29335 | 2.5067 |
133
- | 2.3336 | 1.5 | 35202 | 2.4968 |
134
- | 2.3486 | 1.75 | 41069 | 2.4942 |
135
-
136
-
137
- ### Framework versions
138
-
139
- - PEFT 0.10.0
140
- - Transformers 4.40.0.dev0
141
- - Pytorch 2.2.0+cu121
142
- - Datasets 2.15.0
143
- - Tokenizers 0.15.0
144
-
145
-
146
- ## Evaluation
147
-
148
- <!-- This section describes the evaluation protocols and provides the results. -->
149
-
150
- ### Testing Data, Factors & Metrics
151
-
152
- #### Testing Data
153
-
154
- <!-- This should link to a Dataset Card if possible. -->
155
-
156
- [More Information Needed]
157
-
158
- #### Factors
159
-
160
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
161
-
162
- [More Information Needed]
163
-
164
- #### Metrics
165
-
166
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
167
-
168
- [More Information Needed]
169
-
170
- ### Results
171
-
172
- [More Information Needed]
173
-
174
- #### Summary
175
-
176
-
177
- ## Model Examination [optional]
178
-
179
- <!-- Relevant interpretability work for the model goes here -->
180
-
181
- [More Information Needed]
182
-
183
-
184
- ## Citation [optional]
185
-
186
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
187
-
188
- **BibTeX:**
189
-
190
- ```bibtex
191
- @misc{prince2024downcycling,
192
- title={Efficient LLM Downcycling: Generating Diverse Model Sizes from Pretrained Giants},
193
- author={Prince Canuma},
194
- year={2024},
195
- }
196
- ```
197
-
198
  [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
199
  <details><summary>See axolotl config</summary>
200
 
@@ -273,4 +198,117 @@ special_tokens:
273
 
274
  ```
275
 
276
- </details><br>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  library_name: transformers
6
  datasets:
7
  - prince-canuma/fineweb-CC-MAIN-2024-10-1B-en
8
+ - HuggingFaceFW/fineweb
9
+ tags:
10
+ - Llama-3-6B
11
+ - 6B
12
  ---
13
 
14
  # Model Summary
15
+ <img src="images/llama-3-6B icon.jpeg" width="500" alt="Llama-3-6B"/>
16
 
17
  Introducing the world's first Llama-3 base model with 6B parameters. This model is a pretrained version of [prince-canuma/Llama-3-6B-v0](https://huggingface.co/prince-canuma/Llama-3-6B-v0), which was created from Meta-Llama-3-8B using a technique called [downcycling](https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=9hcOol4KHIgWThgt) .
18
  The model was continually pretrained on 1 billion tokens of English-only text from fineweb, achieving impressive results on the evaluation set:
 
28
  - **Developed by:** [Prince Canuma](https://huggingface.co/prince-canuma)
29
  - **Sponsored by:** General
30
  - **Model type:** Llama
31
+ - **License:** [Llama-3](https://llama.meta.com/llama3/license)
 
32
  - **Pretrained from model:** prince-canuma/Llama-3-6B-v0
33
 
34
+ ### Model Sources
35
 
36
  <!-- Provide the basic links for the model. -->
37
 
38
  - **Repository:** https://github.com/Blaizzy/Coding-LLMs-from-scratch/tree/main/Llama-3
39
+ - **Video:** https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr
40
 
41
  ## Uses
42
 
 
86
 
87
  ### Downcycling
88
 
89
+ <img src="images/downcycling.jpeg" width="500" alt="Llama-3-8B-vs-6B-v0"/>
90
+ Fig 1. Downcycling workflow as also described in [arxiv.org/abs/2404.08634](https://arxiv.org/abs/2404.08634).
 
91
 
92
+ A technique that allows you to create new LLMs of diversa sizes from checkpoints of large pretrained models.
93
+ You take a reference model (i.e., Llama-3-8B) and copy the weights of 24 layers out of 32 layers alongside embedding and prediction heads.
94
+ Then you initialize a smaller target model with 24 layers and load those pretrained weights.
95
 
96
+ This new model will most likely still output legible outputs, but for it to perform well you need continue the pretraining.
97
 
98
+ <img src="images/Llama-3-8B-vs-6B-v0.png" width="500" alt="Llama-3-8B-vs-6B-v0"/>
99
+ Fig 2. Downcycled model vs Reference model, without continued pretraining.
100
 
101
  ### Training Data
102
 
103
  For continued pretrained, I extracted 1B tokens from [Huggingface's FineWeb CC-Main-2024-10](https://huggingface.co/datasets/HuggingFaceFW/fineweb#breakdown-by-dumpcrawl) slice.
104
 
 
 
 
 
 
 
 
 
105
 
106
  #### Training hyperparameters
107
 
 
120
  - lr_scheduler_warmup_steps: 100
121
  - num_epochs: 2
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
124
  <details><summary>See axolotl config</summary>
125
 
 
198
 
199
  ```
200
 
201
+ </details><br>
202
+
203
+ ### Training results
204
+
205
+ There were 3 distinct experiments. In these experiments, QLoRA was used instead of Full Fine-tuning due to budget constraints.
206
+ - v0: This was a test ran for 1K steps to check if the model would improve with QLoRA params.
207
+ - v1: Here the QLoRA parameters where tweaked (Rank and Alpha).
208
+ - v2: This was the main experiment, ran for 2 epochs on 1B tokens from FineWeb.
209
+
210
+ All details can be found on my Wandb dashboard: https://wandb.ai/prince-canuma/llama-3-6b?nw=nwuserprincecanuma
211
+
212
+ <img src="images/Training Loss.png" width="500" alt="Llama-3-8B-vs-6B-v0"/>
213
+ Fig 3. Experiment training loss charts on wandb.
214
+
215
+ Overal metrics:
216
+
217
+ | Training Loss | Epoch | Step | Validation Loss |
218
+ |:-------------:|:-----:|:-----:|:---------------:|
219
+ | 7.1562 | 0.0 | 1 | 7.1806 |
220
+ | 2.7339 | 0.25 | 5867 | 2.6266 |
221
+ | 2.6905 | 0.5 | 11734 | 2.5872 |
222
+ | 2.6134 | 0.75 | 17601 | 2.5549 |
223
+ | 2.532 | 1.0 | 23468 | 2.5235 |
224
+ | 2.5319 | 1.25 | 29335 | 2.5067 |
225
+ | 2.3336 | 1.5 | 35202 | 2.4968 |
226
+ | 2.3486 | 1.75 | 41069 | 2.4942 |
227
+
228
+
229
+
230
+
231
+ ### Framework versions
232
+
233
+ - PEFT 0.10.0
234
+ - Transformers 4.40.0.dev0
235
+ - Pytorch 2.2.0+cu121
236
+ - Datasets 2.15.0
237
+ - Tokenizers 0.15.0
238
+
239
+ ### Hardware:
240
+
241
+ - 4xRTX6000 using JarvisLabs
242
+
243
+
244
+ ## Evaluation
245
+
246
+ <!-- This section describes the evaluation protocols and provides the results. -->
247
+
248
+ #### Benchmarks
249
+
250
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
251
+
252
+ - **Hellaswag**: a dataset for studying grounded commonsense inference.
253
+ - **ARC**: a multiple-choice question-answering dataset.
254
+ from science exams from grade 3 to grade 9.
255
+ - **MMLU**: a test with 57 tasks to measure a text model's multitask accuracy.
256
+ - **TruthfulQA**: a test to measure a model's propensity to reproduce falsehoods commonly found online.
257
+ - **Winogrande**: for commonsense reasoning.
258
+ - **GSM8k**: diverse grade school math word problems to measure a model's
259
+ ability to solve multi-step mathematical reasoning problems.
260
+
261
+ ### Results
262
+
263
+ <img src="images/comparison_model_scores_histogram.png" width="500" alt="Llama-3-8B-vs-6B-v0"/>
264
+ Fig 4. Performance comparision of Llama-3-8B, Llama-3-6B and Llama-3-6B (w/ continued pretraining)
265
+
266
+ Pretraining for 2 epochs on 1B tokens had a positive effect across the board. The new base model now performs competitively with its reference model (Llama-3-8B) whilst being 1.3x smaller.
267
+
268
+ <img src="images/Comparision_of_Model_Scores.png" width="500" alt="All-vs-Llama-3-6B-v0"/>
269
+ Fig 5. Performance comparision of Llama-3-8B, Llama-2-13B, Yi-1.5-6B and Llama-3-6B.
270
+
271
+ Llama-3-6B is competive with model within it's category and upto 2x larger than it self across 6 diverse benchmarks.
272
+
273
+ #### Summary
274
+
275
+
276
+
277
+ ## Citation
278
+
279
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
280
+
281
+ **BibTeX:**
282
+
283
+ ```bibtex
284
+ @misc{prince2024downcycling,
285
+ title={Efficient LLM Downcycling: Generating Diverse Model Sizes from Pretrained Giants},
286
+ author={Prince Canuma},
287
+ year={2024},
288
+ }
289
+ ```
290
+
291
+
292
+ ## References:
293
+
294
+ ```bibtex
295
+ @misc{komatsuzaki2023sparse,
296
+ title={Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints},
297
+ author={Aran Komatsuzaki and Joan Puigcerver and James Lee-Thorp and Carlos Riquelme Ruiz and Basil Mustafa and Joshua Ainslie and Yi Tay and Mostafa Dehghani and Neil Houlsby},
298
+ year={2023},
299
+ eprint={2212.05055},
300
+ archivePrefix={arXiv},
301
+ primaryClass={cs.LG}
302
+ }
303
+ ```
304
+
305
+ ```bibtex
306
+ @misc{sanyal2024pretraining,
307
+ title={Pre-training Small Base LMs with Fewer Tokens},
308
+ author={Sunny Sanyal and Sujay Sanghavi and Alexandros G. Dimakis},
309
+ year={2024},
310
+ eprint={2404.08634},
311
+ archivePrefix={arXiv},
312
+ primaryClass={cs.CL}
313
+ }
314
+ ```