File size: 8,132 Bytes
feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b feba7eb ce7b25b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
base_model: Alignment-Lab-AI/Neural-network-medium-untuned-theta
tags:
- axolotl
- Alignment-Lab-AI
- Meta-Llama-3
model-index:
- name: Buzz-8b-Large-0.5
results: []
license: apache-2.0
datasets:
- H-D-T/Buzz
language:
- en
---
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6436279eaaef013d1af225c9/fWaQucBWfabfnMsAFN8hv.png)
# Buzz-8b-Large: Advancing Efficiency through Iterative Fine-Tuning
## Introduction
- [Alignment Lab AI](https://AlignmentLab.ai) is pleased to introduce our latest research efforts with:
**Buzz-8b-Large**, a state-of-the-art language model developed in collaboration with [Hive Digital Technologies](https://hivedt.com/).
The Buzz model, Dataset, and Code are to be released to build a toolkit that aims to demonstrate the potential for reuse and optimization of existing pretrained language models to continuously refine the heights of performance that can be achieved with optimal use of FlOps. Alongside Buzz-5b-Medium, we release
- [The Buzz Dataset](https://huggingface.co/datasets/H-D-T/Buzz)
- [Buzz-2.5b-Small] soon!
- [Buzz-5b-Medium] soon!
- [Buzz-8B-Large](https://huggingface.co/tempbuzz/Lab-AI/Buzz-8B-Large)
the **Buzz dataset** and two additional models: **Buzz-2.5B-Small** and **Buzz-5B-Medium**, the codebase to refine, filter and augment the data, as well as prune and train your own variants, will additionally be released in the coming days.
## Performance
Buzz-8b-Large achieves remarkably low train and validation loss, with unseen data loss reaching around **0.5** by the end of training. This performance showcases the effectiveness of our novel iterative fine-tuning approach, which maximizes the reuse of pretrained weights. Even the smallest variant, Buzz-Small, maintains a steady train loss of approximately **0.4-0.6**, on entirely new data and hold out sets.
[ benchmark scores table here]
## Iterative Fine-Tuning Methodology
Our research builds upon the concepts introduced in several key papers, including:
- [Simple and Scalable Strategies to Continually Pre-train Large Language Models](https://arxiv.org/abs/2403.08763)
- [NEFTune: Noisy Embeddings Improve Instruction Finetuning](https://arxiv.org/abs/2310.05914)
- [An Optimistic Acceleration of AMSGrad for Nonconvex Optimization](https://arxiv.org/abs/1903.01435)
- [Improving Generalization Performance by Switching from Adam to SGD](https://arxiv.org/abs/1712.07628)
- [Orca: Progressive Learning from Complex Explanation Traces of GPT-4](https://arxiv.org/abs/2306.02707v1)
By combining high quality data, iterative fine-tuning with carefully selected "grounding" distributions from previous epochs, we have developed a cost-effective approach that pushes the boundaries of model reuse and optimization.
## notably, we observe that the models have not yet appeared to plateu with the application of these techniques
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6436279eaaef013d1af225c9/wyHyDIJnNmbomonZKQAD0.png)
## Chat Template and Inference
To use the Buzz-8b-Medium model for chat-based tasks, you can utilize the provided chat template. Here's an example of how to format the chat template and perform inference using the Hugging Face Transformers library:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model
model_name = "H-D-T/Buzz-8b-Large-v0.5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Set the device to run the model on (e.g., "cuda" for GPU, "cpu" for CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# Define the input prompt
prompt = "Hello, how are you today?"
# Tokenize the input prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
# Generate the model's response
output = model.generate(
input_ids,
max_length=100,
num_return_sequences=1,
no_repeat_ngram_size=2,
early_stopping=True
)
# Decode the generated response
response = tokenizer.decode(output[0], skip_special_tokens=True)
print("Input:", prompt)
print("Response:", response)
``````
## Conclusion
We intend to focus on *updating* and improving the performance of these models, and surrounding open sourced infrastructure. Our next effort will focus on context and implementing the research currently being conducted by [Wing-Lian](https://github.com/winglian), the lead developer of the [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) training framework that underpins these experiments. We encourage the community to explore Wing-Lian's work, such as the [Llama-3-8b-64k-PoSE](https://huggingface.co/winglian/Llama-3-8b-64k-PoSE) and [llama-3-8b-256k-PoSE](https://huggingface.co/winglian/llama-3-8b-256k-PoSE) models, which showcase the potential for further advancements in language modeling.
Buzz hopes to be a proof of concept, and a toolkit to demonstrate and enable the community in the pursuit of efficient and effective locally run, personally owned, language models. Through collaboration with [Hive Digital Technologies](https://hivedigitaltechnologies.com/) who have enabled us to perform this research, we have demonstrated the immense potential for model reuse and optimization. The Buzz models and dataset are open sourced with [////////].
## Credits
to the many researchers who have open sourced their knowledge and tools to allow us to pursue this,
to [Hive Digital Technologies](https://hivedigitaltechnologies.com/) for providing compute, advice, and meaningful research insight.
to [Meta](https://llama.meta.com) for developing the Llama models, and maintaining a philosophy of supporting open research and open source.
To wing et al. with [Open Access AI Collective](https://github.com/OpenAccess-AI-Collective) for developing [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), assisting with research, and generally being geniuses.
to [Thomas Capelle](https://wandb.ai/capecape) et al. working on [LLM_Surgery](https://wandb.ai/llm_surgery)
as well as many, many others who are too numerous to name.
# Citations
```
@misc{ibrahim2024simple,
title={Simple and Scalable Strategies to Continually Pre-train Large Language Models},
author={Adam Ibrahim and Benjamin Thérien and Kshitij Gupta and Mats L. Richter and Quentin Anthony and Timothée Lesort and Eugene Belilovsky and Irina Rish},
year={2024},
eprint={2403.08763},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{jain2023neftune,
title={NEFTune: Noisy Embeddings Improve Instruction Finetuning},
author={Neel Jain and Ping-yeh Chiang and Yuxin Wen and John Kirchenbauer and Hong-Min Chu and Gowthami Somepalli and Brian R. Bartoldson and Bhavya Kailkhura and Avi Schwarzschild and Aniruddha Saha and Micah Goldblum and Jonas Geiping and Tom Goldstein},
year={2023},
eprint={2310.05914},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{wang2020optimistic,
title={An Optimistic Acceleration of AMSGrad for Nonconvex Optimization},
author={Jun-Kun Wang and Xiaoyun Li and Belhal Karimi and Ping Li},
year={2020},
eprint={1903.01435},
archivePrefix={arXiv},
primaryClass={stat.ML}
}
@misc{keskar2017improving,
title={Improving Generalization Performance by Switching from Adam to SGD},
author={Nitish Shirish Keskar and Richard Socher},
year={2017},
eprint={1712.07628},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{mukherjee2023orca,
title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4},
author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah},
year={2023},
eprint={2306.02707},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |