|
--- |
|
language: |
|
- en |
|
license: llama3 |
|
library_name: transformers |
|
datasets: |
|
- prince-canuma/fineweb-CC-MAIN-2024-10-1B-en |
|
--- |
|
|
|
# Model Summary |
|
<img src="llama-3-6B icon.jpeg" width="500" alt="Llama-3-6B"/> |
|
|
|
Introducing the world's first Llama-3 base model with 6B parameters. This model is a pretrained version of [prince-canuma/Llama-3-6B-v0](https://huggingface.co/prince-canuma/Llama-3-6B-v0), which was created from Meta-Llama-3-8B using a technique called [downcycling](https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=9hcOol4KHIgWThgt) . |
|
The model was continually pretrained on 1 billion tokens of English-only text from fineweb, achieving impressive results on the evaluation set: |
|
- Loss: 2.4942 |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
- **Developed by:** [Prince Canuma](https://huggingface.co/prince-canuma) |
|
- **Sponsored by:** General |
|
- **Model type:** Llama |
|
- **Language(s) (NLP):** [More Information Needed] |
|
- **License:** MIT |
|
- **Pretrained from model:** prince-canuma/Llama-3-6B-v0 |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/Blaizzy/Coding-LLMs-from-scratch/tree/main/Llama-3 |
|
- **Video [optional]:** https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
You can use this model to create instruct and chat versions for various use cases such as: Coding assistant, RAG, Function Calling and more. |
|
|
|
### Limitations |
|
|
|
This model inherits some of the base model's limitations and some additional ones from it's creation process, such as: |
|
- Limited scope for coding and math: According to benchmarks, this model needs more pretraining/finetuning on code and math data to excel at reasoning tasks. |
|
- Language Limitations: This model was continually pretrained on english only data. If you are planning to use it for multilingual use cases I recommend fine-tuning or continued pretraining. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer |
|
|
|
# Load model, config and tokenizer |
|
model_name = "prince-canuma/Llama-3-6B-v0.1" |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
inputs = tokenizer( |
|
[ |
|
"Who created Python?" |
|
], return_tensors = "pt") |
|
|
|
from transformers import TextStreamer |
|
text_streamer = TextStreamer(tokenizer) |
|
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 200) |
|
|
|
``` |
|
|
|
Output: |
|
```shell |
|
<|begin_of_text|>Who created Python? What is Python used for? What is the difference between Python 2 and Python 3? What is the difference between Python and Python 3? |
|
Python is a programming language that was created by Guido van Rossum in 1991. It is a widely used language for web development, data science, and machine learning. Python is also used for creating software applications and games. |
|
Python is a powerful language that is easy to learn and use. It has a large library of built-in functions and packages that make it easy to write code. Python is also a very popular language for web development, with many popular web frameworks such as Django and Flask being written in Python. |
|
Python is also used for data science and machine learning. It has a large library of packages for data analysis, machine learning, and artificial intelligence. Python is also used for creating software applications and games. |
|
Python 2 and Python 3 are two different versions of the Python language. Python 2 was the original version of the |
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
### Downcycling |
|
|
|
A technique that allows you to create new LLMs of diversa sizes from checkpoints of large pretrained models. |
|
You take a reference model (i.e., Llama-3-8B) and copy the weights of 24 layers out of 32 layers alongside embedding and prediction heads. Then you initialize a smaller target model with 24 layers and load those pretrained weights. |
|
This new model will most likely still output legible outputs, but for it to perform well you need continue the pretraining. |
|
|
|
|
|
|
|
|
|
### Training Data |
|
|
|
For continued pretrained, I extracted 1B tokens from [Huggingface's FineWeb CC-Main-2024-10](https://huggingface.co/datasets/HuggingFaceFW/fineweb#breakdown-by-dumpcrawl) slice. |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
#### Preprocessing [optional] |
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 0.0002 |
|
- train_batch_size: 2 |
|
- eval_batch_size: 2 |
|
- seed: 42 |
|
- distributed_type: multi-GPU |
|
- num_devices: 4 |
|
- gradient_accumulation_steps: 8 |
|
- total_train_batch_size: 64 |
|
- total_eval_batch_size: 8 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: cosine |
|
- lr_scheduler_warmup_steps: 100 |
|
- num_epochs: 2 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|:-------------:|:-----:|:-----:|:---------------:| |
|
| 7.1562 | 0.0 | 1 | 7.1806 | |
|
| 2.7339 | 0.25 | 5867 | 2.6266 | |
|
| 2.6905 | 0.5 | 11734 | 2.5872 | |
|
| 2.6134 | 0.75 | 17601 | 2.5549 | |
|
| 2.532 | 1.0 | 23468 | 2.5235 | |
|
| 2.5319 | 1.25 | 29335 | 2.5067 | |
|
| 2.3336 | 1.5 | 35202 | 2.4968 | |
|
| 2.3486 | 1.75 | 41069 | 2.4942 | |
|
|
|
|
|
### Framework versions |
|
|
|
- PEFT 0.10.0 |
|
- Transformers 4.40.0.dev0 |
|
- Pytorch 2.2.0+cu121 |
|
- Datasets 2.15.0 |
|
- Tokenizers 0.15.0 |
|
|
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
[More Information Needed] |
|
|
|
### Results |
|
|
|
[More Information Needed] |
|
|
|
#### Summary |
|
|
|
|
|
## Model Examination [optional] |
|
|
|
<!-- Relevant interpretability work for the model goes here --> |
|
|
|
[More Information Needed] |
|
|
|
|
|
## Citation [optional] |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@misc{prince2024downcycling, |
|
title={Efficient LLM Downcycling: Generating Diverse Model Sizes from Pretrained Giants}, |
|
author={Prince Canuma}, |
|
year={2024}, |
|
} |
|
``` |
|
|
|
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl) |
|
<details><summary>See axolotl config</summary> |
|
|
|
axolotl version: `0.4.0` |
|
```yaml |
|
base_model: prince-canuma/Llama-3-6B-v0.1 |
|
model_type: AutoModelForCausalLM |
|
tokenizer_type: AutoTokenizer |
|
|
|
load_in_8bit: false |
|
load_in_4bit: true |
|
strict: false |
|
|
|
datasets: |
|
- path: prince-canuma/fineweb-CC-MAIN-2024-10-1B-en |
|
type: completion |
|
split: train |
|
dataset_prepared_path: last_run_prepared |
|
val_set_size: 0.001 |
|
output_dir: ./llama-3-6b |
|
save_safetensors: true |
|
adapter: qlora |
|
lora_model_dir: |
|
|
|
sequence_len: 8192 |
|
sample_packing: false |
|
pad_to_sequence_len: false |
|
|
|
lora_r: 128 |
|
lora_alpha: 128 |
|
lora_dropout: 0.05 |
|
lora_target_modules: |
|
lora_target_linear: true |
|
lora_fan_in_fan_out: |
|
|
|
|
|
wandb_project: llama-3-6b |
|
wandb_entity: |
|
wandb_watch: |
|
wandb_name: |
|
wandb_log_model: |
|
|
|
gradient_accumulation_steps: 8 |
|
micro_batch_size: 2 |
|
num_epochs: 2 |
|
optimizer: paged_adamw_32bit |
|
lr_scheduler: cosine |
|
learning_rate: 2e-4 |
|
|
|
train_on_inputs: false |
|
group_by_length: false |
|
bf16: auto |
|
fp16: |
|
tf32: false |
|
|
|
gradient_checkpointing: true |
|
early_stopping_patience: |
|
resume_from_checkpoint: |
|
local_rank: |
|
logging_steps: 1 |
|
xformers_attention: |
|
flash_attention: true |
|
|
|
warmup_steps: 100 |
|
evals_per_epoch: 4 |
|
eval_table_size: |
|
save_steps: 4000 |
|
debug: |
|
deepspeed: |
|
weight_decay: 0.0 |
|
fsdp: |
|
fsdp_config: |
|
special_tokens: |
|
pad_token: "<|reserved_special_token_0|>" |
|
|
|
|
|
``` |
|
|
|
</details><br> |