prince-canuma
/

Llama-3-6B-v0.1

@@ -5,10 +5,14 @@ license: llama3
 library_name: transformers
 datasets:
 - prince-canuma/fineweb-CC-MAIN-2024-10-1B-en
 ---
 # Model Summary
-<img src="llama-3-6B icon.jpeg" width="500" alt="Llama-3-6B"/>
 Introducing the world's first Llama-3 base model with 6B parameters. This model is a pretrained version of [prince-canuma/Llama-3-6B-v0](https://huggingface.co/prince-canuma/Llama-3-6B-v0), which was created from Meta-Llama-3-8B using a technique called [downcycling](https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=9hcOol4KHIgWThgt) .
 The model was continually pretrained on 1 billion tokens of English-only text from fineweb, achieving impressive results on the evaluation set:
@@ -24,16 +28,15 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 - **Developed by:** [Prince Canuma](https://huggingface.co/prince-canuma)
 - **Sponsored by:** General
 - **Model type:** Llama
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** MIT
 - **Pretrained from model:** prince-canuma/Llama-3-6B-v0
-### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
 - **Repository:** https://github.com/Blaizzy/Coding-LLMs-from-scratch/tree/main/Llama-3
-- **Video [optional]:** https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr
 ## Uses
@@ -83,25 +86,22 @@ Python 2 and Python 3 are two different versions of the Python language. Python
 ### Downcycling
-A technique that allows you to create new LLMs of diversa sizes from checkpoints of large pretrained models.
-You take a reference model (i.e., Llama-3-8B) and copy the weights of 24 layers out of 32 layers alongside embedding and prediction heads. Then you initialize a smaller target model with 24 layers and load those pretrained weights.
-This new model will most likely still output legible outputs, but for it to perform well you need continue the pretraining.
 ### Training Data
 For continued pretrained, I extracted 1B tokens from [Huggingface's FineWeb CC-Main-2024-10](https://huggingface.co/datasets/HuggingFaceFW/fineweb#breakdown-by-dumpcrawl) slice.
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training hyperparameters
@@ -120,81 +120,6 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_steps: 100
 - num_epochs: 2
-### Training results
-| Training Loss | Epoch | Step  | Validation Loss |
-|:-------------:|:-----:|:-----:|:---------------:|
-| 7.1562        | 0.0   | 1     | 7.1806          |
-| 2.7339        | 0.25  | 5867  | 2.6266          |
-| 2.6905        | 0.5   | 11734 | 2.5872          |
-| 2.6134        | 0.75  | 17601 | 2.5549          |
-| 2.532         | 1.0   | 23468 | 2.5235          |
-| 2.5319        | 1.25  | 29335 | 2.5067          |
-| 2.3336        | 1.5   | 35202 | 2.4968          |
-| 2.3486        | 1.75  | 41069 | 2.4942          |
-### Framework versions
-- PEFT 0.10.0
-- Transformers 4.40.0.dev0
-- Pytorch 2.2.0+cu121
-- Datasets 2.15.0
-- Tokenizers 0.15.0
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-```bibtex
-@misc{prince2024downcycling,
-      title={Efficient LLM Downcycling: Generating Diverse Model Sizes from Pretrained Giants},
-      author={Prince Canuma},
-      year={2024},
-}
-```
 [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
 <details><summary>See axolotl config</summary>
@@ -273,4 +198,117 @@ special_tokens:
 ```
-</details><br>

 library_name: transformers
 datasets:
 - prince-canuma/fineweb-CC-MAIN-2024-10-1B-en
+- HuggingFaceFW/fineweb
+tags:
+- Llama-3-6B
+- 6B
 ---
 # Model Summary
+<img src="images/llama-3-6B icon.jpeg" width="500" alt="Llama-3-6B"/>
 Introducing the world's first Llama-3 base model with 6B parameters. This model is a pretrained version of [prince-canuma/Llama-3-6B-v0](https://huggingface.co/prince-canuma/Llama-3-6B-v0), which was created from Meta-Llama-3-8B using a technique called [downcycling](https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=9hcOol4KHIgWThgt) .
 The model was continually pretrained on 1 billion tokens of English-only text from fineweb, achieving impressive results on the evaluation set:
 - **Developed by:** [Prince Canuma](https://huggingface.co/prince-canuma)
 - **Sponsored by:** General
 - **Model type:** Llama
+- **License:** [Llama-3](https://llama.meta.com/llama3/license)
 - **Pretrained from model:** prince-canuma/Llama-3-6B-v0
+### Model Sources
 <!-- Provide the basic links for the model. -->
 - **Repository:** https://github.com/Blaizzy/Coding-LLMs-from-scratch/tree/main/Llama-3
+- **Video:** https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr
 ## Uses
 ### Downcycling
+<img src="images/downcycling.jpeg" width="500" alt="Llama-3-8B-vs-6B-v0"/>
+Fig 1. Downcycling workflow as also described in [arxiv.org/abs/2404.08634](https://arxiv.org/abs/2404.08634).
+A technique that allows you to create new LLMs of diversa sizes from checkpoints of large pretrained models.
+You take a reference model (i.e., Llama-3-8B) and copy the weights of 24 layers out of 32 layers alongside embedding and prediction heads.
+Then you initialize a smaller target model with 24 layers and load those pretrained weights.
+This new model will most likely still output legible outputs, but for it to perform well you need continue the pretraining.
+<img src="images/Llama-3-8B-vs-6B-v0.png" width="500" alt="Llama-3-8B-vs-6B-v0"/>
+Fig 2. Downcycled model vs Reference model, without continued pretraining.
 ### Training Data
 For continued pretrained, I extracted 1B tokens from [Huggingface's FineWeb CC-Main-2024-10](https://huggingface.co/datasets/HuggingFaceFW/fineweb#breakdown-by-dumpcrawl) slice.
 #### Training hyperparameters
 - lr_scheduler_warmup_steps: 100
 - num_epochs: 2
 [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
 <details><summary>See axolotl config</summary>
 ```
+</details><br>
+### Training results
+There were 3 distinct experiments. In these experiments, QLoRA was used instead of Full Fine-tuning due to budget constraints.
+- v0: This was a test ran for 1K steps to check if the model would improve with QLoRA params.
+- v1: Here the QLoRA parameters where tweaked (Rank and Alpha).
+- v2: This was the main experiment, ran for 2 epochs on 1B tokens from FineWeb.
+ All details can be found on my Wandb dashboard: https://wandb.ai/prince-canuma/llama-3-6b?nw=nwuserprincecanuma
+<img src="images/Training Loss.png" width="500" alt="Llama-3-8B-vs-6B-v0"/>
+Fig 3. Experiment training loss charts on wandb.
+Overal metrics:
+| Training Loss | Epoch | Step  | Validation Loss |
+|:-------------:|:-----:|:-----:|:---------------:|
+| 7.1562        | 0.0   | 1     | 7.1806          |
+| 2.7339        | 0.25  | 5867  | 2.6266          |
+| 2.6905        | 0.5   | 11734 | 2.5872          |
+| 2.6134        | 0.75  | 17601 | 2.5549          |
+| 2.532         | 1.0   | 23468 | 2.5235          |
+| 2.5319        | 1.25  | 29335 | 2.5067          |
+| 2.3336        | 1.5   | 35202 | 2.4968          |
+| 2.3486        | 1.75  | 41069 | 2.4942          |
+### Framework versions
+- PEFT 0.10.0
+- Transformers 4.40.0.dev0
+- Pytorch 2.2.0+cu121
+- Datasets 2.15.0
+- Tokenizers 0.15.0
+### Hardware:
+- 4xRTX6000 using JarvisLabs
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+#### Benchmarks
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+- **Hellaswag**: a dataset for studying grounded commonsense inference.
+- **ARC**: a multiple-choice question-answering dataset.
+from science exams from grade 3 to grade 9.
+- **MMLU**: a test with 57 tasks to measure a text model's multitask accuracy.
+- **TruthfulQA**: a test to measure a model's propensity to reproduce falsehoods commonly found online.
+- **Winogrande**: for commonsense reasoning.
+- **GSM8k**: diverse grade school math word problems to measure a model's
+ability to solve multi-step mathematical reasoning problems.
+### Results
+<img src="images/comparison_model_scores_histogram.png" width="500" alt="Llama-3-8B-vs-6B-v0"/>
+Fig 4. Performance comparision of Llama-3-8B, Llama-3-6B and Llama-3-6B (w/ continued pretraining)
+Pretraining for 2 epochs on 1B tokens had a positive effect across the board. The new base model now performs competitively with its reference model (Llama-3-8B) whilst being 1.3x smaller.
+<img src="images/Comparision_of_Model_Scores.png" width="500" alt="All-vs-Llama-3-6B-v0"/>
+Fig 5. Performance comparision of Llama-3-8B, Llama-2-13B, Yi-1.5-6B and Llama-3-6B.
+Llama-3-6B is competive with model within it's category and upto 2x larger than it self across 6 diverse benchmarks.
+#### Summary
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+```bibtex
+@misc{prince2024downcycling,
+      title={Efficient LLM Downcycling: Generating Diverse Model Sizes from Pretrained Giants},
+      author={Prince Canuma},
+      year={2024},
+}
+```
+## References:
+```bibtex
+@misc{komatsuzaki2023sparse,
+      title={Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints},
+      author={Aran Komatsuzaki and Joan Puigcerver and James Lee-Thorp and Carlos Riquelme Ruiz and Basil Mustafa and Joshua Ainslie and Yi Tay and Mostafa Dehghani and Neil Houlsby},
+      year={2023},
+      eprint={2212.05055},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+```
+```bibtex
+@misc{sanyal2024pretraining,
+      title={Pre-training Small Base LMs with Fewer Tokens},
+      author={Sunny Sanyal and Sujay Sanghavi and Alexandros G. Dimakis},
+      year={2024},
+      eprint={2404.08634},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```